bacalhau.corpus.Corpus¶
-
class
bacalhau.corpus.Corpus(corpus_path, document_class, tokenizer=<WordPunctTokenizer object>, stopwords=<Mock object>, **document_kwargs)[source]¶ Bases:
objectA manager class to generate topic hierarchies from files.
Creates a new
Corpusfor the given path, using the givenbacalhau.document.Documentclass to process the files.Parameters: - corpus_path (
str) – path to the files. - document_class (
bacalhau.document.Document) – document class used to process the corpus files. - tokenizer (
nltk.tokenize.api.TokenizerI) – tokenizer used to tokenize the files in the corpus, defaults tonltk.tokenize.regexp.WordPunctTokenizer. - stopwords (
list) – words to be removed from the texts, defaults tonltk.corpus.stopwords.words('english').
-
_add_tf_idf(term_data)[source]¶ Returns
term_datawith a TF.IDF value added to each term/text combination.Parameters: term_data ( dict) – dict with term/text combination.Return type: dict
-
_get_documents()[source]¶ Creates a
bacalhau.document.Documentobject for each of the files in the corpus, and returns them in alist.Parameters: corpus_path ( str) – path to the corpus files.Returns: documents in this corpus. Return type: list
-
_get_hypernym(word)[source]¶ Returns a list of the hypernyms for the given word.
Parameters: word ( str) – the word to get the hypernym for.Return type: list
-
_get_term_data()[source]¶ Returns term data for all of the
bacalhau.document.Documentobjects in this corpus.Return type: dict
-
_get_text_count()[source]¶ Returns the number of
bacalhau.text.Textobjects in this corpus.Return type: float
-
annotate_topic_tree(tree)[source]¶ Annotates the nodes in the
bacalhau.topic_tree.TopicTreewith information about whichbacalhau.text.Textand counts the nodes relate to.Parameters: tree ( bacalhau.topic_tree.TopicTree) – topic tree of termsReturn type: bacalhau.topic_tree.TopicTree
-
generate_topic_tree(n_terms)[source]¶ Generates a
bacalhau.topic_tree.TopicTreefor the corpus, using a maximum ofn_termsfrom eachbacalhau.text.Text. First extracts top terms; second gets hypernyms for each of the terms; third creates thebacalhau.topic_tree.TopicTreeusing the hypernyms.Parameters: n_terms ( int) – maximum number of terms to be used from eachText.Returns: the generated topic tree. Return type: bacalhau.topic_tree.TopicTree
-
get_hypernyms(top_terms)[source]¶ Returns a dictionary with the hypernyms for the given terms.
Parameters: top_terms ( dict) – dict with term/text information.Returns: {text: {term: hypernym}}. Return type: dict
-
get_top_terms(n_terms)[source]¶ Returns a dictionary with the highest
n_termsfor eachbacalhau.text.Textfrom the term data dictionary.Parameters: n_terms ( int) – maximum number of terms to be used from each text.Returns: dict
-
get_topic_tree(hypernyms)[source]¶ Generates and returns a
bacalhau.topic_tree.TopicTreefor the given hypernyms.Parameters: hypernyms ( dict) – dictionary of hypernyms.Return type: bacalhau.topic_tree.TopicTree
- corpus_path (