bacalhau.corpus.Corpus¶
-
class
bacalhau.corpus.
Corpus
(corpus_path, document_class, tokenizer=<WordPunctTokenizer object>, stopwords=<Mock object>, **document_kwargs)[source]¶ Bases:
object
A manager class to generate topic hierarchies from files.
Creates a new
Corpus
for the given path, using the givenbacalhau.document.Document
class to process the files.Parameters: - corpus_path (
str
) – path to the files. - document_class (
bacalhau.document.Document
) – document class used to process the corpus files. - tokenizer (
nltk.tokenize.api.TokenizerI
) – tokenizer used to tokenize the files in the corpus, defaults tonltk.tokenize.regexp.WordPunctTokenizer
. - stopwords (
list
) – words to be removed from the texts, defaults tonltk.corpus.stopwords.words('english')
.
-
_add_tf_idf
(term_data)[source]¶ Returns
term_data
with a TF.IDF value added to each term/text combination.Parameters: term_data ( dict
) – dict with term/text combination.Return type: dict
-
_get_documents
()[source]¶ Creates a
bacalhau.document.Document
object for each of the files in the corpus, and returns them in alist
.Parameters: corpus_path ( str
) – path to the corpus files.Returns: documents in this corpus. Return type: list
-
_get_hypernym
(word)[source]¶ Returns a list of the hypernyms for the given word.
Parameters: word ( str
) – the word to get the hypernym for.Return type: list
-
_get_term_data
()[source]¶ Returns term data for all of the
bacalhau.document.Document
objects in this corpus.Return type: dict
-
_get_text_count
()[source]¶ Returns the number of
bacalhau.text.Text
objects in this corpus.Return type: float
-
annotate_topic_tree
(tree)[source]¶ Annotates the nodes in the
bacalhau.topic_tree.TopicTree
with information about whichbacalhau.text.Text
and counts the nodes relate to.Parameters: tree ( bacalhau.topic_tree.TopicTree
) – topic tree of termsReturn type: bacalhau.topic_tree.TopicTree
-
generate_topic_tree
(n_terms)[source]¶ Generates a
bacalhau.topic_tree.TopicTree
for the corpus, using a maximum ofn_terms
from eachbacalhau.text.Text
. First extracts top terms; second gets hypernyms for each of the terms; third creates thebacalhau.topic_tree.TopicTree
using the hypernyms.Parameters: n_terms ( int
) – maximum number of terms to be used from eachText
.Returns: the generated topic tree. Return type: bacalhau.topic_tree.TopicTree
-
get_hypernyms
(top_terms)[source]¶ Returns a dictionary with the hypernyms for the given terms.
Parameters: top_terms ( dict
) – dict with term/text information.Returns: {text: {term: hypernym}}. Return type: dict
-
get_top_terms
(n_terms)[source]¶ Returns a dictionary with the highest
n_terms
for eachbacalhau.text.Text
from the term data dictionary.Parameters: n_terms ( int
) – maximum number of terms to be used from each text.Returns: dict
-
get_topic_tree
(hypernyms)[source]¶ Generates and returns a
bacalhau.topic_tree.TopicTree
for the given hypernyms.Parameters: hypernyms ( dict
) – dictionary of hypernyms.Return type: bacalhau.topic_tree.TopicTree
- corpus_path (