Retrieve fulltext and metadata for all editions of one project
Task:
We want to analyse the fulltext content of all editions for one project.
We want to have access to the edition metadata.
As the there may be lots of editions we split the job by paging the search results.
from tgclients import TextgridSearch, TextgridSearchRequest, Aggregator, TextgridConfig
from tgclients.config import DEV_SERVER
###
# prepare textgrid clients which are configured to use the dev instance,
# because aggregating fulltext of editions may put high load on the aggregator
###
config = TextgridConfig(DEV_SERVER)
tgsearch = TextgridSearch(config)
aggregator = Aggregator(config)
###
# choose a project ID, look at https://dev.textgridrep.org/projects for inspiration
###
#project_id = 'TGPR-1789bd93-99c5-58e4-e100-619e27ec1119' # Keine Wahlwerbung (1 Ed.)
project_id = 'TGPR-f3e628ae-74b8-2ebb-9fee-614c59c9b522' # Distant Reading – 2021-09-23 (8 Ed.)
#project_id = 'TGPR-ca80b39a-5487-27ee-1289-6294a25f975a' # Goethes Farbenlehre (51 Ed.)
#project_id = 'TGPR-44684af6-1d30-b6d0-3665-62a87b5380b7' # CoNSSA (219 Ed.)
#project_id = 'TGPR-372fe6dc-57f2-6cd4-01b5-2c4bbefcfd3c' # Digitale Bibliothek (93462 Ed.)
###
# start is the pointer which gets incremented, starting with 0
# limit is the number of search results to retrieve at once
###
start = None
limit = 10
while True:
###
# filter for all editions in the chosen project
###
results = tgsearch.search(
filters=[
'project.id:'+project_id,
'format:text/tg.edition+tg.aggregation+xml'],
start=start, limit=limit)
if results.hits:
print('found: ' + results.hits)
for result in results.result:
edition_uri = result.object_value.generic.generated.textgrid_uri.value
edition_agent = result.object_value.edition.agent[0].value
edition_title = result.object_value.generic.provided.title[0]
# edition metadata
print(edition_agent + ' - ' + edition_title + '\n')
###
# aggregate all text content of all children of this edition as plaintext
###
fulltext = aggregator.text(edition_uri).text
print(fulltext[0:100]) # just the first 100 chars to not pollute this notebook
print("---\n")
###
# if there is e next attribute in reslts continue with nextpage
# otherwise stop
###
if results.next != None:
start = results.next
else:
break
print('\n+------+\n| DONE |\n+------+')
found: 8
- The Mayor of Casterbridge: The Life and Death of a Man of Character
THE MAYOR OF CASTERBRIDGE
by Thomas Hardy
1.
One evening of late summer, before the nineteenth centu
---
Carroll, Lewis, 1832-1898 - Alice's Adventures in Wonderland
ALICE’S ADVENTURES IN WONDERLAND
By Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down
---
Frances Trollope - The Life and Adventures of Michael Armstrong
PREFACE.
When the author of "Michael Armstrong" first determined on attempting to draw the attention
---
Nesbit, E. (Edith), 1858-1924 - The Red House
THE RED HOUSE A Novel
BY E. NESBIT AUTHOR OF “THE TREASURE “THE WOULDBEGOODS,”
ILLUSTRATED BY A.I. K
---
Fontane, Theodor - Unterm Birnbaum
Erstes Kapitel
Vor dem in dem großen und reichen Oderbruchdorfe Tschechin um Michaeli 20 eröffneten
---
Otto, Louise - Nürnberg. Zweiter Band
Erstes Capitel
Gobelins
Die kalten Strahlen einer halbverschleierten Wintersonne brachen sich auf de
---
Christ, Lena - Mathias Bichler
Im Weidhof
Meine Kostmutter hat mir gesagt, daß ich am vierten Sonntag nach der Erscheinung des Herr
---
Willkomm, Ernst Adolf - Weisse Sclaven oder die Leiden des Volkes
Erster Theil
Erstes Buch
Erstes Kapitel.
Der Haidekretscham.
Ein ansehnlicher Theil der beiden Lausi
---
+------+
| DONE |
+------+
Retrieve just fulltext without edition metadata as plaintext
we just want the plaintext of all xml objects in a project
from tgclients import TextgridSearch, TextgridSearchRequest, Aggregator, TextgridConfig
from tgclients.config import DEV_SERVER
###
# prepare textgrid clients which are configured to use the dev instance,
# because aggregating fulltext of editions may put high load on the aggregator
###
config = TextgridConfig(DEV_SERVER)
tgsearch = TextgridSearch(config)
aggregator = Aggregator(config)
###
# choose a project ID, look at https://dev.textgridrep.org/projects for inspiration
###
#project_id = 'TGPR-1789bd93-99c5-58e4-e100-619e27ec1119' # Keine Wahlwerbung (1 Ed.)
project_id = 'TGPR-f3e628ae-74b8-2ebb-9fee-614c59c9b522' # Distant Reading – 2021-09-23 (8 Ed.)
#project_id = 'TGPR-ca80b39a-5487-27ee-1289-6294a25f975a' # Goethes Farbenlehre (51 Ed.)
#project_id = 'TGPR-44684af6-1d30-b6d0-3665-62a87b5380b7' # CoNSSA (219 Ed.)
#project_id = 'TGPR-372fe6dc-57f2-6cd4-01b5-2c4bbefcfd3c' # Digitale Bibliothek (93462 Ed.)
###
# start is the pointer which gets incremented, starting with 0
# limit is the number of search results to retrieve at once
###
start = None
limit = 10
while True:
###
# filter for all editions in the chosen project
###
results = tgsearch.search(
filters=[
'project.id:'+project_id,
'format:text/xml'],
start=start, limit=limit)
if results.hits:
print('found: ' + results.hits)
for result in results.result:
uri = result.object_value.generic.generated.textgrid_uri.value
title = result.object_value.generic.provided.title[0]
# edition metadata
print(title + '\n')
###
# use the aggregator tou extract plaintext from xml object
###
fulltext = aggregator.text(uri).text
print(fulltext[0:100]) # just the first 100 chars to not pollute this notebook
print("---\n")
###
# if there is e next attribute in reslts continue with nextpage
# otherwise stop
###
if results.next != None:
start = results.next
else:
break
print('\n+------+\n| DONE |\n+------+')
found: 8
Alice's Adventures in Wonderland, by Lewis Carroll
ALICE’S ADVENTURES IN WONDERLAND
By Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down
---
The Red House : VWWP edition
THE RED HOUSE A Novel
BY E. NESBIT AUTHOR OF “THE TREASURE “THE WOULDBEGOODS,”
ILLUSTRATED BY A.I. K
---
The Project Gutenberg EBook of The Mayor of Casterbridge, by Thomas Hardy
THE MAYOR OF CASTERBRIDGE
by Thomas Hardy
1.
One evening of late summer, before the nineteenth centu
---
The Life and Adventures of Michael Armstrong, the Factory Boy
PREFACE.
When the author of "Michael Armstrong" first determined on attempting to draw the attention
---
Mathias Bichler
Im Weidhof
Meine Kostmutter hat mir gesagt, daß ich am vierten Sonntag nach der Erscheinung des Herr
---
Weisse Sclaven oder die Leiden des Volkes
Erster Theil
Erstes Buch
Erstes Kapitel.
Der Haidekretscham.
Ein ansehnlicher Theil der beiden Lausi
---
Nürnberg. Zweiter Band
Erstes Capitel
Gobelins
Die kalten Strahlen einer halbverschleierten Wintersonne brachen sich auf de
---
Unterm Birnbaum
Erstes Kapitel
Vor dem in dem großen und reichen Oderbruchdorfe Tschechin um Michaeli 20 eröffneten
---
+------+
| DONE |
+------+
Download all XML objects of an project
just download all raw xml objects of one project
from tgclients import TextgridSearch, TextgridSearchRequest, TextgridCrud, TextgridConfig
from tgclients.config import DEV_SERVER
###
# prepare textgrid clients which are configured to use the dev instance,
# because aggregating fulltext of editions may put high load on the aggregator
###
config = TextgridConfig(DEV_SERVER)
tgsearch = TextgridSearch(config)
tgcrud = TextgridCrud(config)
###
# choose a project ID, look at https://dev.textgridrep.org/projects for inspiration
###
#project_id = 'TGPR-1789bd93-99c5-58e4-e100-619e27ec1119' # Keine Wahlwerbung (1 Ed.)
project_id = 'TGPR-f3e628ae-74b8-2ebb-9fee-614c59c9b522' # Distant Reading – 2021-09-23 (8 Ed.)
#project_id = 'TGPR-ca80b39a-5487-27ee-1289-6294a25f975a' # Goethes Farbenlehre (51 Ed.)
#project_id = 'TGPR-44684af6-1d30-b6d0-3665-62a87b5380b7' # CoNSSA (219 Ed.)
#project_id = 'TGPR-372fe6dc-57f2-6cd4-01b5-2c4bbefcfd3c' # Digitale Bibliothek (93462 Ed.)
###
# start is the pointer which gets incremented, starting with 0
# limit is the number of search results to retrieve at once
###
start = None
limit = 10
while True:
###
# filter for all editions in the chosen project
###
results = tgsearch.search(
filters=[
'project.id:'+project_id,
'format:text/xml'],
start=start, limit=limit)
if results.hits:
print('found: ' + results.hits)
for result in results.result:
uri = result.object_value.generic.generated.textgrid_uri.value
title = result.object_value.generic.provided.title[0]
# edition metadata
print(title + '\n')
###
# use the aggregator tou extract plaintext from xml object
###
data = tgcrud.read_data(uri).text
print(data[0:100]) # just the first 100 chars to not pollute this notebook
print("---\n")
###
# if there is e next attribute in reslts continue with nextpage
# otherwise stop
###
if results.next != None:
start = results.next
else:
break
print('\n+------+\n| DONE |\n+------+')
found: 8
Alice's Adventures in Wonderland, by Lewis Carroll
<?xml version='1.0' encoding='UTF-8'?><TEI xmlns:jxb="http://java.sun.com/xml/ns/jaxb" xmlns:tei="ht
---
The Red House : VWWP edition
<?xml version='1.0' encoding='UTF-8'?><TEI xmlns:jxb="http://java.sun.com/xml/ns/jaxb" xmlns:tei="ht
---
The Project Gutenberg EBook of The Mayor of Casterbridge, by Thomas Hardy
<?xml version='1.0' encoding='UTF-8'?><TEI xmlns:jxb="http://java.sun.com/xml/ns/jaxb" xmlns:tei="ht
---
The Life and Adventures of Michael Armstrong, the Factory Boy
<?xml version='1.0' encoding='UTF-8'?><TEI xmlns:jxb="http://java.sun.com/xml/ns/jaxb" xmlns:tei="ht
---
Mathias Bichler
<?xml version='1.0' encoding='UTF-8'?><TEI xmlns:jxb="http://java.sun.com/xml/ns/jaxb" xmlns:tei="ht
---
Weisse Sclaven oder die Leiden des Volkes
<?xml version='1.0' encoding='UTF-8'?><TEI xmlns:jxb="http://java.sun.com/xml/ns/jaxb" xmlns:tei="ht
---
Nürnberg. Zweiter Band
<?xml version='1.0' encoding='UTF-8'?><TEI xmlns:jxb="http://java.sun.com/xml/ns/jaxb" xmlns:tei="ht
---
Unterm Birnbaum
<?xml version='1.0' encoding='UTF-8'?><TEI xmlns:jxb="http://java.sun.com/xml/ns/jaxb" xmlns:tei="ht
---
+------+
| DONE |
+------+