Reference#
Fuzz#
The fuzz module mimicks the fuzzywuzzy-like packages like
- fuzzywuzzy (seatgeek/fuzzywuzzy) 
- rapidfuzz (maxbachmann/rapidfuzz) 
The main difference is that the Levenshtein distance is replaced by the Joint Complexity distance. The API is also slightly change to enable new features:
- The list of possible choices can be pre-trained ( - fit()) to accelerate the computation in the case a stream of queries is sent against the same list of choices.
- Instead of one single query, a list of queries can be used. Computations will be parallelized. 
The main fuzz entry point is the Process class.
- class bof.fuzz.Process(n_range=5, preprocessor=None, length_impact=0.5, allow_updates=True)[source]#
- The process class is used to compute the closest choices from a list of queries base on joint complexity. - Parameters:
- n_range ( - intor None, optional) – Maximum factor size. If None, all factors will be extracted.:
- preprocessor (callable, optional) – Preprocessing function to apply to texts before adding them to the factor tree. 
- length_impact ( - float) – Importance of the length difference between two texts when computing the scores.
- allow_updates ( - bool) – When transforming queries, are new factors kept in the- CountVectorizer
- filename ( - str, optional) – If set, load from corresponding file.
- path ( - stror- Path, optional) – If set, specify the directory where the file is located.
 
 - vectorizer#
- The vectorizer used to compute factors. - Type:
 
 - choices_matrix#
- The factor matrix of the choices. - Type:
 
 - dedupe(contains_dup, threshold=60.0)[source]#
- Inspired by fuzzywuzzy’s dedupe function to remove (near) duplicates. Currently barely optimized (and probably bugged). - Parameters:
- Return type:
 - Examples - >>> contains_dupes = ['Frodo Baggin', 'Frodo Baggins', 'F. Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins'] >>> p = Process() >>> p.dedupe(contains_dupes) ['Frodo Baggins', 'F. Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins'] - F. Baggins is kept because the length difference impacts the results. Let us ignore the length. - >>> p.length_impact = 0.0 >>> p.dedupe(contains_dupes) ['Frodo Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins'] 
 - extract(queries, choices=None, limit=5, score_cutoff=40.0)[source]#
- Find the best matches amongst a list of choices. - Parameters:
- queries ( - stror- listof- str) – Text (or list of texts) to match amongst the queries.
- choices ( - listof- str, optional) – Possible choices. If None, the previously used (or fitted) choices will be used.
- limit ( - intor None) – Maximal number of results to give. If None, all choices will be returned (sorted).
- score_cutoff ( - float, optional) – Minimal score that a result must achieve to be considered a match
 
- Returns:
- If queries is a single text, the list of tuples containing the best choices and their scores. If queries is a list of texts, the list of list of tuples containing the best choices and their scores. 
- Return type:
 - Examples - >>> p = Process() >>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] >>> p.extract("new york jets", choices, limit=2) [('New York Jets', 100.0), ('New York Giants', 46.835443037974684)] >>> p.extract("new york jets", choices, limit=None) [('New York Jets', 100.0), ('New York Giants', 46.835443037974684), ('Atlanta Falcons', 0.0), ('Dallas Cowboys', 0.0)] >>> p.extract(["new york", "atlanta"], choices, limit=2, score_cutoff=0.0) [[('New York Jets', 56.60377358490566), ('New York Giants', 47.61904761904762)], [('Atlanta Falcons', 37.28813559322034), ('New York Giants', 7.594936708860759)]] 
 - extractOne(queries, choices=None, score_cutoff=40.0)[source]#
- Find the best match amongst a list of choices. - Parameters:
- Returns:
- If queries is a single text, a tuple containing the best choice and its score. If queries is a list of texts, the list of tuples containing the best choices and their scores. 
- Return type:
 - Examples - >>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] >>> p = Process() >>> p.extractOne("Cowboys", choices) ('Dallas Cowboys', 42.857142857142854) >>> p.extractOne(["Cowboys", "falcon's"], choices) [('Dallas Cowboys', 42.857142857142854), None] >>> p.extractOne(["Cowboys", "falcon's"], choices, score_cutoff=30) [('Dallas Cowboys', 42.857142857142854), ('Atlanta Falcons', 30.88235294117647)] 
 - fit(choices)[source]#
- Compute the factors of a list of choices. - Examples - >>> p = Process() >>> p.fit(["riri", "fifi", "rififi"]) - The choices: - >>> p.choices ['riri', 'fifi', 'rififi'] - Numbe of unique factors for each choice: - >>> p.choices_factors.astype(int) array([ 7, 7, 14]) - The matrix that associates factors to choices: - >>> p.choices_matrix.toarray() array([[2, 0, 1], [2, 0, 1], [1, 0, 0], [1, 0, 0], [2, 2, 3], [1, 0, 0], [1, 0, 0], [0, 2, 2], [0, 2, 2], [0, 1, 1], [0, 1, 1], [0, 1, 2], [0, 1, 2], [0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1]], dtype=uint32) - The corresponding factors: - >>> p.vectorizer.features ['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi'] 
 - reset()[source]#
- Clear choices from the object. - Examples - >>> p = Process() >>> p.fit(["riri", "fifi", "rififi"]) >>> p.choices ['riri', 'fifi', 'rififi'] >>> p.choices_factors.astype(int) array([ 7, 7, 14]) >>> p.reset() >>> p.choices is None True >>> p.choices_factors is None True 
 - transform(queries, threshold=0.0)[source]#
- Compute the joint complexities of queries against choices, - Parameters:
 - Examples - >>> p = Process() >>> p.fit(["riri", "fifi", "rififi"]) - Notice the number of factors: - >>> p.vectorizer.m 18 - >>> p.transform(["rir", "fido", "rafifi", "screugneuhneu"]) array([[71.42857143, 9.09090909, 18.75 ], [ 6.25 , 21.42857143, 14.28571429], [ 9.09090909, 41.17647059, 34.7826087 ], [ 1.92307692, 0. , 1.69491525]]) - The factors have been augmented with the ones from the queries: - >>> p.vectorizer.m 79 - This is could be a memory issue if you keep entering very different queries. To keep the factors clean after a transform, set allow_updates to False. - >>> p.allow_updates = False >>> p.fit(["riri", "fifi", "rififi"]) >>> p.transform(["rir", "fido", "rafifi", "screugneuhneu"]) array([[71.42857143, 9.09090909, 18.75 ], [ 6.25 , 21.42857143, 14.28571429], [ 9.09090909, 41.17647059, 34.7826087 ], [ 1.92307692, 0. , 1.69491525]]) >>> p.vectorizer.m 18 >>> p.vectorizer.features ['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi'] 
 
- bof.fuzz.get_best_choice(choices, scores, score_cutoff)[source]#
- Given a list of choices with scores, extract the best choice. - Parameters:
- Returns:
- Tuple containing the choice and its score if the latter is above the cutoff, None otherwise. 
- Return type:
- tupleor None
 
- bof.fuzz.get_best_choices(choices, scores, limit)[source]#
- Given a list of choices with scores, extract the best choices. - Parameters:
- Returns:
- List of tuples containing the choices and their scores. 
- Return type:
 
- bof.fuzz.jit_common_factors(queries_length, xind, xptr, choices_length, yind, yptr, m)[source]#
- Jitted function to compute the common factors between a corpus of queries and a corpus of choices. - Parameters:
- queries_length ( - int) – Number of documents in the corpus of queries
- xind ( - ndarray) – Indices of the query factor matrix
- xptr ( - ndarray) – pointers of the query factor matrix
- choices_length ( - int) – Number of documents in the corpus of choices
- yind ( - ndarray) – Indices of the transposed choices factor matrix
- yptr ( - ndarray) – Pointers of the transposed choices factor matrix
- m ( - int) – Size of the factor space for choices
 
- Returns:
- A queries_length X choices_length matrix that contains the number of (unique) factors between choices and queries. 
- Return type:
 
- bof.fuzz.jit_jc(queries_factors, choices_factors, common_factors, length_impact, threshold=0.0)[source]#
- Jitted function to compute a joint complexity between a corpus of queries and a corpus of choices. - Parameters:
- queries_factors ( - ndarray) – Vector of the number of unique factors for each query.
- choices_factors ( - ndarray) – Vector of the number of unique factors for each choice.
- common_factors ( - ndarray) – Matrix of the number of common unique factors between queries and choices.
- length_impact ( - float) – Importance of the length difference between two texts when computing the scores.
- threshold ( - float) – Don’t compute JC is common factors is less than threshold X (# query factors)
 
- Returns:
- Joint Complexity matrix 
- Return type:
 
- bof.fuzz.jit_square_factors(xind, xptr, yind, yptr, n, length_impact)[source]#
- Jitted function to compute the joint complexity between texts of a corpus. - Parameters:
- xind ( - ndarray) – Indices of the factor matrix
- xptr ( - ndarray) – pointers of the factor matrix
- yind ( - ndarray) – Indices of the transposed factor matrix
- yptr ( - ndarray) – Pointers of the transposed factor matrix
- n ( - int) – Corpus size
- length_impact ( - float) – Importance of the length difference between two texts when computing the scores.
 
- Returns:
- A n X n matrix that contains joint complexity scores of the corpus. 
- Return type:
 
Feature Extraction#
The feature_extraction module mimicks the module https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text
with a focus on character-based extraction.
The main differences are:
- it is slightly faster; 
- the features can be incrementally updated; 
- it is possible to fit only a random sample of factors to reduce space and computation time. 
The main entry point for this module is the CountVectorizer class, which mimicks
its scikit-learn counterpart (also named CountVectorizer).
It is in fact very similar to sklearn’s CountVectorizer using char or
char_wb analyzer option from that module.
- class bof.feature_extraction.CountVectorizer(n_range=5, preprocessor=None)[source]#
- Counts the factors of a list of document. - Parameters:
- preprocessor (callable, optional) – Preprocessing function to apply to texts before adding them to the factor tree. 
- n_range ( - intor None, optional) – Maximum factor size. If None, all factors will be extracted.
- filename ( - str, optional) – If set, load from corresponding file.
- path ( - stror- Path, optional) – If set, specify the directory where the file is located.
 
 - Examples - Build a vectorizer limiting factor size to 3: - >>> vectorizer = CountVectorizer(n_range=3) - Build the factor matrix of a corpus of texts. - >>> corpus = ["riri", "fifi", "rififi"] >>> vectorizer.fit_transform(corpus=corpus).toarray() array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0], [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32) - List the factors in the corpus: - >>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif'] - property features#
- Get the list of features (internally, features are stored as a Numba Typed - dictthat associates factors to indexes).
 - fit(corpus, reset=True)[source]#
- Build the features. Does not build the factor matrix. - Parameters:
- Return type:
- None 
 - Examples - We compute the factors of a corpus. - >>> vectorizer = CountVectorizer(n_range=3) >>> vectorizer.fit(["riri", "fifi", "rififi"]) - The fit method does not return anything, but the factors have been populated: - >>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif'] - We fit another corpus. - >>> vectorizer.fit(["riri", "fifi"]) - The factors have been implicitly reset (rif is gone in this toy example): - >>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi'] - We keep pre-existing factors by setting reset to False: - >>> vectorizer.fit(["rififi"], reset=False) - The list of features has been updated (with rif`): - >>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif'] 
 - fit_transform(corpus, reset=True)[source]#
- Build the features and return the factor matrix. - Parameters:
- Returns:
- A sparse matrix that indicates for each document of the corpus its factors and their multiplicity. 
- Return type:
 - Examples - Build a FactorTree from a corpus of three documents: - >>> vectorizer = CountVectorizer(n_range=3) >>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray() array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0], [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32) - List of factors (of size at most 3): - >>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif'] - Build a FactorTree from a corpus of two documents. - >>> vectorizer.fit_transform(["fifi", "rififi"]).toarray() array([[2, 2, 1, 2, 1, 1, 0, 0, 0], [2, 2, 1, 3, 2, 2, 1, 1, 1]], dtype=uint32) - Notice the implicit reset, as only factors from “fifi” and “rififi” are present: - >>> vectorizer.features ['f', 'fi', 'fif', 'i', 'if', 'ifi', 'r', 'ri', 'rif'] - >>> vectorizer.m 9 - With reset set to False, we can add another list without discarding pre-existing factors. - >>> vectorizer.fit_transform(["riri"], reset=False).toarray() array([[0, 0, 0, 2, 0, 0, 2, 2, 0, 1, 1, 1]], dtype=uint32) - Notice the presence of empty columns, which corresponds to pre-existing factors that do not exist in “riri”. - The size and list of factors: - >>> vectorizer.m 12 - >>> vectorizer.features ['f', 'fi', 'fif', 'i', 'if', 'ifi', 'r', 'ri', 'rif', 'rir', 'ir', 'iri'] - Setting n_range to None will compute all factors. - >>> vectorizer.n_range = None >>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray() array([[2, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 1, 0, 0, 3, 0, 0, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1]], dtype=uint32) 
 - sampling_fit(corpus, reset=True, sampling_rate=0.5, seed=None)[source]#
- Build a partial factor tree where only a random subset of factors are selected. Note that there is no sampling_fit_transform method, as mutualizing the processes would introduce incoherences in the factor description: you have to do a sampling_fit followed by a transform. - Parameters:
- Return type:
- None 
 - Examples - We fit a corpus to a tree a normal way to see the complete list of factors of size at most 3.. - >>> vectorizer = CountVectorizer() >>> vectorizer.fit(["riri", "fifi", "rififi"]) >>> vectorizer.features ['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi'] - Now we use a sampling fit instead. Only a subset of the factors are selected. - >>> vectorizer.sampling_fit(["riri", "fifi", "rififi"], seed=42) >>> vectorizer.features ['r', 'ri', 'rir', 'riri', 'f', 'fi', 'fif', 'fifi', 'i', 'if', 'ifi'] - We random fit another corpus. We reset the seed to reproduce the example above. - >>> vectorizer.sampling_fit(["riri", "fifi"], sampling_rate=.2) - The factors have been implicitly reset. - >>> vectorizer.features ['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri'] - We add another corpus to the fit by setting reset to False: - >>> vectorizer.sampling_fit(["rififi"], reset=False, sampling_rate=.2) - The list of features has been updated: - >>> vectorizer.features ['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi'] 
 - transform(corpus)[source]#
- Build factor matrix from the factors already computed. New factors are discarded. - Parameters:
- Returns:
- The factor count of the input corpus NB: if reset is set to False, the factor count of the pre-existing corpus is not returned but is internally preserved. 
- Return type:
 - Examples - To start, we fit a corpus: - >>> vectorizer = CountVectorizer(n_range=3) >>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray() array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0], [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32) - The factors are: - >>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif'] - We now apply a transform. - >>> vectorizer.transform(["fir", "rfi"]).toarray() array([[1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0]], dtype=uint32) - The features have not been updated. For example, the only factors reported for “rfi” are “r”, “i”, “f”, and “fi”. Factors that were not fit (e.g. rf) are discarded. 
 
- bof.feature_extraction.build_end(n_range=None)[source]#
- Return a function of a starting position s and a text length tl that tells the end of scanning text from s. It avoids to test the value of n_range all the time when doing factor extraction. - Parameters:
- n_range ( - intor None) – Maximal factor size. If 0 or None, all factors are considered.
- Return type:
- callable 
 - Examples - >>> end = build_end() >>> end(7, 15) 15 >>> end(13, 15) 15 >>> end = build_end(5) >>> end(7, 15) 12 >>> end(13, 15) 15 
Common#
The common module contains miscellaneous classes and functions.
- class bof.common.MixInIO[source]#
- Provide basic save/load capacities to other classes. - dump(filename: str, path='.', overwrite=False, compress=True, stemize=True)[source]#
- Save instance to file. - Parameters:
 - Examples - >>> import tempfile >>> v1 = ToyClass(42) >>> v2 = ToyClass() >>> v2.value 0 >>> with tempfile.TemporaryDirectory() as tmpdirname: ... v1.dump(filename='myfile', compress=True, path=tmpdirname) ... dir_content = [file.name for file in Path(tmpdirname).glob('*')] ... v2 = ToyClass.load(filename='myfile', path=Path(tmpdirname)) ... v1.dump(filename='myfile', compress=True, path=tmpdirname) # doctest.ELLIPSIS File ...myfile.pkl.gz already exists! Use overwrite option to overwrite. >>> dir_content ['myfile.pkl.gz'] >>> v2.value 42 - >>> with tempfile.TemporaryDirectory() as tmpdirname: ... v1.dump(filename='myfile', compress=False, path=tmpdirname) ... v1.dump(filename='myfile', compress=False, path=tmpdirname) # doctest.ELLIPSIS File ...myfile.pkl already exists! Use overwrite option to overwrite. - >>> v1.value = 51 >>> with tempfile.TemporaryDirectory() as tmpdirname: ... v1.dump(filename='myfile', path=tmpdirname, compress=False) ... v1.dump(filename='myfile', path=tmpdirname, overwrite=True, compress=False) ... v2 = ToyClass.load(filename='myfile', path=tmpdirname) ... dir_content = [file.name for file in Path(tmpdirname).glob('*')] >>> dir_content ['myfile.pkl'] >>> v2.value 51 - >>> with tempfile.TemporaryDirectory() as tmpdirname: ... v2 = ToyClass.load(filename='thisfilenamedoesnotexist') Traceback (most recent call last): ... FileNotFoundError: [Errno 2] No such file or directory: ...