bacalhau.text.Text¶
-
class
bacalhau.text.
Text
(text_id, content, tokenizer, stopwords)[source]¶ Bases:
object
Represents a text unit from a
bacalhau.document.Document
.Creates a new
Text
object.Parameters: -
_is_valid_token
(token)[source]¶ Checks if the
token
is suitable for processing. A token is suitable if: it is not in the list of stopwords; it is composed of alphabetical character; and is a considered a noun by WordNet.Parameters: token ( str
) – the token to validate.Returns: True if token
is valid.Return type: bool
-
get_term_data
()[source]¶ Returns term data for this text.
The term’s data are the unnormalised and normalised frequency counts of the term in this text. The former uses the “count” key, the latter “frequency”.
The data is structured as a nested dictionary (term -> text -> counts) for easy merging of the term data from multiple
Text
s.Return type: dict
-