Count all Words

This notebook shows how to create an own list of objects from the TextGrid repository matching specific search critera, and how to retrieve the text content of found objects.

In the first example we use TG-search to get a list of all XML objects in the TextGrid Repository. We retrieve all these objects as plaintext from the Aggregator, und count the words with a simple regexp.

A second example uses the Aggregator service to do all the counting by utilizing a custom XSLT. This will reduce the traffic and put load where it belongs: server-side.

For both cases the complete runtime will exceed hours. To speed up the process, we carry the number of objects and words until a start_date and sum up with newly added objects. This demonstrates the usage of an more advanced filter at TG-search. In the first cell, we set parameters valid for both examples.

(Uses tqdm for progress bar, which needs to be installed in Jupyters environment, just comment pbar related lines out if tqdm is not available.)

Lets start setting a few parameters we are using in both examples.

from IPython.display import Markdown, display
###
# we keep track of previously counted words and just add new 
# to not have to download 118019 xml documents again
###
start_date = '2025-07-01'
end_date = '2025-09-01'
# all words until start-date - we already counted until that date ;-)
words_until_start_date = 356664820
objects_until_start_date = 119011
print(f'''
From last count we know that until {start_date} there were {objects_until_start_date:,} 
xml objects with {words_until_start_date:,} words in the TextGrid repository. 
Now we count how many where added until {end_date}.
''')

# query objects added between start_date and end_date, only text/xml
query = 'lastModified:{'+start_date+' TO '+end_date+'}'
filters = ['format:text/xml']

display(Markdown('Browse yourself: <https://textgridrep.org/search?query=lastModified:{'+start_date+'+TO+'+end_date+'}&filter=format:text/xml>'))

limit = 50
From last count we know that until 2025-07-01 there were 119,011 
xml objects with 356,664,820 words in the TextGrid repository. 
Now we count how many where added until 2025-09-01.

Browse yourself: https://textgridrep.org/search?query=lastModified:{2025-07-01+TO+2025-09-01}&filter=format:text/xml

import re
from tqdm.notebook import tqdm
from tgclients import TextgridSearch, TextgridSearchRequest, Aggregator

# we use tgsearch to find xml objects, aggregator to get plaintext
tgsearch = TextgridSearch()
aggregator = Aggregator()

all_words = 0

# get the first set of results
results = tgsearch.search(query=query, filters=filters, limit=limit)

total = int(results.hits)
print(f'tgsearch reponse: {total:,} XML objects where added between {start_date} and {end_date}.')
print('Downloading and counting now…')

# iterate the search results and query again (paging) until there are no hits left
with tqdm(total=total) as pbar:
    while True:
        for result in results.result:
            uri = result.object_value.generic.generated.textgrid_uri.value
    
            # extract plaintext from tei via aggregator
            fulltext = aggregator.text(uri).text

            # count all occurences of one or more alphanumericals chars in row
            num_words = len(re.findall(r'\w+', fulltext))
    
            all_words += num_words
            pbar.set_description(f'words: {all_words:,}')
            pbar.update(1)

        # on the last page there is no "next" token, so stopp, otherwise trigger next search request
        if results.next is None:
            break
        # get next result set from tgsearch
        results = tgsearch.search(query=query, filters=filters, start=results.next, limit=limit)

# tell what we found out
print(f'Found {all_words:,} words in {total:,} XML documents since {start_date},')
print(f'''which sums up to {all_words + words_until_start_date:,} words in 
{total+objects_until_start_date:,} XML documents in the whole repository as of {end_date}''')
tgsearch reponse: 169 XML objects where added between 2025-07-01 and 2025-09-01.
Downloading and counting now…
Found 102,172 words in 169 XML documents since 2025-07-01,
which sums up to 356,766,992 words in 
119,180 XML documents in the whole repository as of 2025-09-01

Efficient Usage of TextGrid Services

We can save some traffic utilizing the server for counting the tokens. We can apply a very simple XSLT to each document returning the number of tokens. So we just have to sum up a few integers. Here we go.

At first we check the applied XSLT to get an idea of what is going on.

from tgclients import TextgridCrud
crud = TextgridCrud()

# read the XML for https://textgridrep.org/textgrid:4228q
xslt_file=crud.read_data('textgrid:4228q').text
print(xslt_file)
<?xml version="1.0" encoding="UTF-8"?>
<stylesheet xmlns="http://www.w3.org/1999/XSL/Transform"
    xmlns:tei="http://www.tei-c.org/ns/1.0"
    version="2.0">
    <output method="text"/>
    <template match="/">
        <!-- 
        making everything below tei:text a single string;
        tokenize that string (we could also add a regular expression,
        if we do not agree with the default one: ``);
        count the items in the sequence created before.
        -->
        <value-of select="count(tokenize((string-join(//tei:TEI/tei:text, ' ')), '\s+'))"/>
    </template>
</stylesheet>

Lets apply this to a single document.

from tgclients import Aggregator
aggregator = Aggregator()

aggregator.render(textgrid_uris='textgrid:vqmz.0', stylesheet_uri='textgrid:4228q').text
'37436'
###
# reset those parameters
###
all_words = 0

# get the first set of results
results = tgsearch.search(query=query, filters=filters, limit=limit)
total = int(results.hits)
print(f'tgsearch reponse: {total:,} new XML objects where added between {start_date} and {end_date}.')

print('Not downloading, but counting now…')

# iterate the search results an query again (paging) until there are no hits left
with tqdm(total=total) as pbar:
    while True:

        for result in results.result:
            uri = result.object_value.generic.generated.textgrid_uri.value
    
            # get number of tokens from the aggregator
            number_str = aggregator.render(textgrid_uris=uri, stylesheet_uri='textgrid:4228q').text

            # count all occurences of one or more alphanumericals chars in row
            num_words = int(number_str)
    
            all_words += num_words
            pbar.set_description(f'words: {all_words:,}')
            pbar.update(1)
    
        if results.next is None:
            break
        # get next result set from tgsearch
        results = tgsearch.search(query=query, filters=filters, start=results.next, limit=limit)

# tell what we found out
print(f'Found {all_words:,} words in {total:,} XML documents since {start_date},')
print(f'''which sums up to {all_words + words_until_start_date:,} words in 
{total+objects_until_start_date:,} XML documents in the whole repository as of {end_date}.''')
tgsearch reponse: 169 new XML objects where added between 2025-07-01 and 2025-09-01.
Not downloading, but counting now…
Found 42,412 words in 169 XML documents since 2025-07-01,
which sums up to 356,707,232 words in 
119,180 XML documents in the whole repository as of 2025-09-01.

It will not count the same as the code before because different regular expressions are used. Feel free to refine and adjust to your needs. XSLT files can be published without further restrictions.