bacalhau.text.Text¶
-
class
bacalhau.text.Text(text_id, content, tokenizer, stopwords)[source]¶ Bases:
objectRepresents a text unit from a
bacalhau.document.Document.Creates a new
Textobject.Parameters: -
_is_valid_token(token)[source]¶ Checks if the
tokenis suitable for processing. A token is suitable if: it is not in the list of stopwords; it is composed of alphabetical character; and is a considered a noun by WordNet.Parameters: token ( str) – the token to validate.Returns: True if tokenis valid.Return type: bool
-
get_term_data()[source]¶ Returns term data for this text.
The term’s data are the unnormalised and normalised frequency counts of the term in this text. The former uses the “count” key, the latter “frequency”.
The data is structured as a nested dictionary (term -> text -> counts) for easy merging of the term data from multiple
Texts.Return type: dict
-