bacalhau.text.Text

class bacalhau.text.Text(text_id, content, tokenizer, stopwords)[source]

Bases: object

Represents a text unit from a bacalhau.document.Document.

Creates a new Text object.

Parameters:
  • text_id (str) – id of the Text.
  • content (str) – content of the Text.
  • tokenizer (nltk.tokenize.api.TokenizerI) – tokenizer used to tokenize the files in the corpus.
  • stopwords (list of words.) – words to be removed from the texts.
_is_valid_token(token)[source]

Checks if the token is suitable for processing. A token is suitable if: it is not in the list of stopwords; it is composed of alphabetical character; and is a considered a noun by WordNet.

Parameters:token (str) – the token to validate.
Returns:True if token is valid.
Return type:bool
get_term_data()[source]

Returns term data for this text.

The term’s data are the unnormalised and normalised frequency counts of the term in this text. The former uses the “count” key, the latter “frequency”.

The data is structured as a nested dictionary (term -> text -> counts) for easy merging of the term data from multiple Texts.

Return type:dict