Count all Words

This notebook shows how to create an own list of objects from the TextGrid repository matching specific search critera, and how to retrieve the text content of found objects.

In the first example we use TG-search to get a list of all XML objects in the TextGrid Repository. We retrieve all these objects as plaintext from the Aggregator, und count the words with a simple regexp.

A second example uses the Aggregator service to do all the counting by utilizing a custom XSLT. This will reduce the traffic and put load where it belongs: server-side.

For both cases the complete runtime will exceed hours. To speed up the process, we carry the number of objects and words until a start_date and sum up with newly added objects. This demonstrates the usage of an more advanced filter at TG-search. In the first cell, we set parameters valid for both examples.

(Uses tqdm for progress bar, which needs to be installed in Jupyters environment, just comment pbar related lines out if tqdm is not available.)

Lets start setting a few parameters we are using in both examples.

###
# we keep track of previously counted words and just add new 
# to not have to download 112461 xml documents again
###
start_date = '2023-10-01'
# all words until start-date - we already counted until that date ;-)
words_until_start_date = 254706269
objects_until_start_date = 112461
print(f'''
From last count we know that until {start_date} there were {objects_until_start_date:,} 
xml objects with {words_until_start_date:,} words in the TextGrid repository.
''')

# query objects newer than start_date, only text/xml
query = 'lastModified:{'+start_date+' TO *}'
filters = ['format:text/xml']

###
# start is the pointer which gets incremented, starting with 0
# limit is the number of search results to retrieve at once
###
start = 0
limit = 100
From last count we know that until 2023-10-01 there were 112,461 
xml objects with 254,706,269 words in the TextGrid repository.
import re
from tqdm.notebook import tqdm
from tgclients import TextgridSearch, TextgridSearchRequest, Aggregator

# we use tgsearch to find xml objects, aggregator to get plaintext
tgsearch = TextgridSearch()
aggregator = Aggregator()

# helpers for the iteration
nextpage = True
all_words = 0

# first search to get total hits for progress bar
result = tgsearch.search(query=query, filters=filters, start=0, limit=0)

total = int(result.hits)
print(f'tgsearch reponse: {total:,} new XML objects where added from {start_date} until today.')
print('Downloading and counting now…')

# iterate the search results an query again (paging) until there are no hits left
with tqdm(total=total) as pbar:
    while nextpage:
    
        # get next result set from tgsearch
        results = tgsearch.search(query=query, filters=filters, start=start, limit=limit)
        
        for result in results.result:
            uri = result.object_value.generic.generated.textgrid_uri.value
    
            # extract plaintext from tei via aggregator
            fulltext = aggregator.text(uri).text

            # count all occurences of one or more alphanumericals chars in row
            num_words = len(re.findall(r'\w+', fulltext))
    
            all_words += num_words
            pbar.set_description(f'words: {all_words:,}')
            pbar.update(1)
    
        # increment the start counter for the next run
        start = start + limit
        if start > int(results.hits):
            # stop if there are no more results left
            nextpage = False

# tell what we found out
print(f'Found {all_words:,} words in {total:,} XML documents since {start_date},')
print(f'''which sums up to {all_words + words_until_start_date:,} words in 
{total+objects_until_start_date:,} XML documents in the whole repository today''')
From last count we know that until 2023-10-01 there were 112,461 
xml objects with 254,706,269 words in the TextGrid repository.

tgsearch reponse: 3 new XML objects where added from 2023-10-01 until today.
Downloading and counting now...
Found 271 words in 3 XML documents since 2023-10-01,

which sums up to 254,706,540 words in 
112,464 XML documents in the whole repository today

Efficient Usage of TextGrid Services

We can save some traffic utilizing the server for counting the tokens. We can apply a very simple XSLT to each document returning the number of tokens. So we just have to sum up a few integers. Here we go.

At first we check the applied XSLT to get an idea of what is going on.

from tgclients import TextgridCrud
crud = TextgridCrud()

# read the XML for https://textgridrep.org/textgrid:kv2q.0
xslt_file=crud.read_data('textgrid:4228q').text
print(xslt_file)
<?xml version="1.0" encoding="UTF-8"?>
<stylesheet xmlns="http://www.w3.org/1999/XSL/Transform"
    xmlns:tei="http://www.tei-c.org/ns/1.0"
    version="2.0">
    <output method="text"/>
    <template match="/">
        <!-- 
        making everything below tei:text a single string;
        tokenize that string (we could also add a regular expression,
        if we do not agree with the default one: ``);
        count the items in the sequence created before.
        -->
        <value-of select="count(tokenize((string-join(//tei:TEI/tei:text, ' ')), '\s+'))"/>
    </template>
</stylesheet>

Lets apply this to a single document.

from tgclients import Aggregator
aggregator = Aggregator()

aggregator.render(textgrid_uris='textgrid:vqmz.0', stylesheet_uri='textgrid:4228q').text
'37436'
###
# reset those parameters
###
start = 0
nextpage = True
all_words = 0

result = tgsearch.search(query=query, filters=filters, start=0, limit=0)
total = int(result.hits)
print(f'tgsearch reponse: {total:,} new XML objects where added from {start_date} until today.')

print('Not downloading, but counting now…')

# iterate the search results an query again (paging) until there are no hits left
with tqdm(total=total) as pbar:
    while nextpage:
    
        # get next result set from tgsearch
        results = tgsearch.search(query=query, filters=filters, start=start, limit=limit)
        
        for result in results.result:
            uri = result.object_value.generic.generated.textgrid_uri.value
    
            # get number of tokens from the aggregator
            number_str = aggregator.render(textgrid_uris=uri, stylesheet_uri='textgrid:4228q').text

            # count all occurences of one or more alphanumericals chars in row
            num_words = int(number_str)
    
            all_words += num_words
            pbar.set_description(f'words: {all_words:,}')
            pbar.update(1)
    
        # increment the start counter for the next run
        start = start + limit
        if start > int(results.hits):
            # stop if there are no more results left
            nextpage = False

# tell what we found out
print(f'Found {all_words:,} words in {total:,} XML documents since {start_date},')
print(f'''which sums up to {all_words + words_until_start_date:,} words in 
{total+objects_until_start_date:,} XML documents in the whole repository today.''')
tgsearch reponse: 3 new XML objects where added from 2023-10-01 until today.
Not downloading, but counting now…
Found 225 words in 3 XML documents since 2023-10-01,

which sums up to 254,706,494 words in 
112,464 XML documents in the whole repository today

It will not count the same as the code before because different regular expressions are used. Feel free to refine and adjust to your needs. XSLT files can be published without further restrictions.