bacalhau.text.Text¶

class bacalhau.text.Text(text_id, content, tokenizer, stopwords)[source]¶

Bases: object

Represents a text unit from a bacalhau.document.Document.

Creates a new Text object.

Parameters:	text_id (`str`) – id of the `Text`. content (`str`) – content of the `Text`. tokenizer (`nltk.tokenize.api.TokenizerI`) – tokenizer used to tokenize the files in the corpus. stopwords (`list` of words.) – words to be removed from the texts.

_is_valid_token(token)[source]¶

Checks if the token is suitable for processing. A token is suitable if: it is not in the list of stopwords; it is composed of alphabetical character; and is a considered a noun by WordNet.

Parameters:	token (`str`) – the token to validate.
Returns:	True if `token` is valid.
Return type:	`bool`

get_term_data()[source]¶

Returns term data for this text.

The term’s data are the unnormalised and normalised frequency counts of the term in this text. The former uses the “count” key, the latter “frequency”.

The data is structured as a nested dictionary (term -> text -> counts) for easy merging of the term data from multiple Texts.

Return type:	`dict`

bacalhau.text.Text¶

Previous topic

Next topic

This Page