Reference

Fuzz

The fuzz module mimicks the fuzzywuzzy-like packages like

The main difference is that the Levenshtein distance is replaced by the Joint Complexity distance. The API is also slightly change to enable new features:

  • The list of possible choices can be pre-trained (fit()) to accelerate the computation in the case a stream of queries is sent against the same list of choices.

  • Instead of one single query, a list of queries can be used. Computations will be parallelized.

The main fuzz entry point is the Process class.

class bof.fuzz.Process(n_range=5, preprocessor=None, length_impact=0.5, allow_updates=True, filename=None, path='.')[source]

The process class is used to compute the closest choices from a list of queries base on joint complexity.

Parameters
  • n_range (int or None, optional) – Maximum factor size. If None, all factors will be extracted.:

  • preprocessor (callable, optional) – Preprocessing function to apply to texts before adding them to the factor tree.

  • length_impact (float) – Importance of the length difference between two texts when computing the scores.

  • allow_updates (bool) – When transforming queries, are new factors kept in the CountVectorizer

  • filename (str, optional) – If set, load from corresponding file.

  • path (str or Path, optional) – If set, specify the directory where the file is located.

vectorizer

The vectorizer used to compute factors.

Type

CountVectorizer

choices

The possible choices.

Type

list of str

choices_matrix

The factor matrix of the choices.

Type

csr_matrix

choices_factors

Number of unique factors of each choice

Type

ndarray

choices_length

Number of choices

Type

int

dedupe(contains_dup, threshold=60.0)[source]

Inspired by fuzzywuzzy’s dedupe function to remove (near) duplicates. Currently barely optimized (and probably bugged).

Parameters
  • contains_dup (list of str) – A list of texts with possible nearly redundant entries.

  • threshold (float, optional) – Minimal score that a result must achieve to be considered redundant.

Return type

list of str

Examples

>>> contains_dupes = ['Frodo Baggin', 'Frodo Baggins', 'F. Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins']
>>> p = Process()
>>> p.dedupe(contains_dupes)
['Frodo Baggins', 'F. Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins']

F. Baggins is kept because the length difference impacts the results. Let us ignore the length.

>>> p.length_impact = 0.0
>>> p.dedupe(contains_dupes)
['Frodo Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins']
extract(queries, choices=None, limit=5, score_cutoff=40.0)[source]

Find the best matches amongst a list of choices.

Parameters
  • queries (str or list of str) – Text (or list of texts) to match amongst the queries.

  • choices (list of str, optional) – Possible choices. If None, the previously used (or fitted) choices will be used.

  • limit (int or None) – Maximal number of results to give. If None, all choices will be returned (sorted).

  • score_cutoff (float, optional) – Minimal score that a result must achieve to be considered a match

Returns

If queries is a single text, the list of tuples containing the best choices and their scores. If queries is a list of texts, the list of list of tuples containing the best choices and their scores.

Return type

list

Examples

>>> p = Process()
>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> p.extract("new york jets", choices, limit=2)
[('New York Jets', 100.0), ('New York Giants', 46.835443037974684)]
>>> p.extract("new york jets", choices, limit=None) 
[('New York Jets', 100.0),
 ('New York Giants', 46.835443037974684),
 ('Atlanta Falcons', 0.0),
 ('Dallas Cowboys', 0.0)]
>>> p.extract(["new york", "atlanta"], choices, limit=2, score_cutoff=0.0) 
[[('New York Jets', 56.60377358490566), ('New York Giants', 47.61904761904762)],
 [('Atlanta Falcons', 37.28813559322034), ('New York Giants', 7.594936708860759)]]
extractOne(queries, choices=None, score_cutoff=40.0)[source]

Find the best match amongst a list of choices.

Parameters
  • queries (str or list of str) – Text (or list of texts) to match amongst the queries.

  • choices (list of str, optional) – Possible choices. If None, the previously used (or fitted) choices will be used.

  • score_cutoff (float, optional) – Minimal score that a result must achieve to be considered a match

Returns

If queries is a single text, a tuple containing the best choice and its score. If queries is a list of texts, the list of tuples containing the best choices and their scores.

Return type

tuple or list

Examples

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> p = Process()
>>> p.extractOne("Cowboys", choices)
('Dallas Cowboys', 42.857142857142854)
>>> p.extractOne(["Cowboys", "falcon's"], choices)
[('Dallas Cowboys', 42.857142857142854), None]
>>> p.extractOne(["Cowboys", "falcon's"], choices, score_cutoff=30)
[('Dallas Cowboys', 42.857142857142854), ('Atlanta Falcons', 30.88235294117647)]
fit(choices)[source]

Compute the factors of a list of choices.

Parameters
  • choices (list of str) –

  • choices. (The possible) –

Return type

None

Examples

>>> p = Process()
>>> p.fit(["riri", "fifi", "rififi"])

The choices:

>>> p.choices
['riri', 'fifi', 'rififi']

Numbe of unique factors for each choice:

>>> p.choices_factors
array([ 7,  7, 14], dtype=int32)

The matrix that associates factors to choices:

>>> p.choices_matrix.toarray() 
array([[2, 0, 1],
       [2, 0, 1],
       [1, 0, 0],
       [1, 0, 0],
       [2, 2, 3],
       [1, 0, 0],
       [1, 0, 0],
       [0, 2, 2],
       [0, 2, 2],
       [0, 1, 1],
       [0, 1, 1],
       [0, 1, 2],
       [0, 1, 2],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1]], dtype=uint32)

The corresponding factors:

>>> p.vectorizer.features
['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi']
reset()[source]

Clear choices from the object.

Examples

>>> p = Process()
>>> p.fit(["riri", "fifi", "rififi"])
>>> p.choices
['riri', 'fifi', 'rififi']
>>> p.choices_factors
array([ 7,  7, 14], dtype=int32)
>>> p.reset()
>>> p.choices is None
True
>>> p.choices_factors is None
True
transform(queries, threshold=0.0)[source]

Compute the joint complexities of queries against choices,

Parameters
  • queries (list of str) – The list of queries.

  • threshold (float) – Relative threshold below which joint complexity is not actually computed.

Examples

>>> p = Process()
>>> p.fit(["riri", "fifi", "rififi"])

Notice the number of factors:

>>> p.vectorizer.m
18
>>> p.transform(["rir", "fido", "rafifi", "screugneuhneu"]) 
array([[71.42857143,  9.09090909, 18.75      ],
       [ 6.25      , 21.42857143, 14.28571429],
       [ 9.09090909, 41.17647059, 34.7826087 ],
       [ 1.92307692,  0.        ,  1.69491525]])

The factors have been augmented with the ones from the queries:

>>> p.vectorizer.m
79

This is could be a memory issue if you keep entering very different queries. To keep the factors clean after a transform, set allow_updates to False.

>>> p.allow_updates = False
>>> p.fit(["riri", "fifi", "rififi"])
>>> p.transform(["rir", "fido", "rafifi", "screugneuhneu"]) 
array([[71.42857143,  9.09090909, 18.75      ],
       [ 6.25      , 21.42857143, 14.28571429],
       [ 9.09090909, 41.17647059, 34.7826087 ],
       [ 1.92307692,  0.        ,  1.69491525]])
>>> p.vectorizer.m
18
>>> p.vectorizer.features
['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi']
bof.fuzz.get_best_choice(choices, scores, score_cutoff)[source]

Given a list of choices with scores, extract the best choice.

Parameters
  • choices (list of str) – List of choices.

  • scores (ndarray) – Score of the choices (must have the same size as choices)

  • score_cutoff (float) – Minimal score that the result must achieve to be considered a match

Returns

Tuple containing the choice and its score if the latter is above the cutoff, None otherwise.

Return type

tuple or None

bof.fuzz.get_best_choices(choices, scores, limit)[source]

Given a list of choices with scores, extract the best choices.

Parameters
  • choices (list of str) – List of choices.

  • scores (ndarray) – Score of the choices (must have the same size as choices)

  • limit (int or None) – Maximal number of results to give. If None, all choices will be returned (sorted).

Returns

List of tuples containing the choices and their scores.

Return type

list

bof.fuzz.jit_common_factors(queries_length, xind, xptr, choices_length, yind, yptr, m)[source]

Jitted function to compute the common factors between a corpus of queries and a corpus of choices.

Parameters
  • queries_length (int) – Number of documents in the corpus of queries

  • xind (ndarray) – Indices of the query factor matrix

  • xptr (ndarray) – pointers of the query factor matrix

  • choices_length (int) – Number of documents in the corpus of choices

  • yind (ndarray) – Indices of the transposed choices factor matrix

  • yptr (ndarray) – Pointers of the transposed choices factor matrix

  • m (int) – Size of the factor space for choices

Returns

A queries_length X choices_length matrix that contains the number of (unique) factors between choices and queries.

Return type

ndarray

bof.fuzz.jit_jc(queries_factors, choices_factors, common_factors, length_impact, threshold=0.0)[source]

Jitted function to compute a joint complexity between a corpus of queries and a corpus of choices.

Parameters
  • queries_factors (ndarray) – Vector of the number of unique factors for each query.

  • choices_factors (ndarray) – Vector of the number of unique factors for each choice.

  • common_factors (ndarray) – Matrix of the number of common unique factors between queries and choices.

  • length_impact (float) – Importance of the length difference between two texts when computing the scores.

  • threshold (float) – Don’t compute JC is common factors is less than threshold X (# query factors)

Returns

Joint Complexity matrix

Return type

ndarray

bof.fuzz.jit_square_factors(xind, xptr, yind, yptr, n, length_impact)[source]

Jitted function to compute the joint complexity between texts of a corpus.

Parameters
  • xind (ndarray) – Indices of the factor matrix

  • xptr (ndarray) – pointers of the factor matrix

  • yind (ndarray) – Indices of the transposed factor matrix

  • yptr (ndarray) – Pointers of the transposed factor matrix

  • n (int) – Corpus size

  • length_impact (float) – Importance of the length difference between two texts when computing the scores.

Returns

A n X n matrix that contains joint complexity scores of the corpus.

Return type

ndarray

bof.fuzz.txt_max(txt_list)[source]

Auxiliary function that returns the index of the longest text from a list. In case of ties, alphabetic order is used.

Parameters

txt_list (list of str) – List of texts.

Returns

The index of the longest text.

Return type

int

Examples

>>> txt_max(["This is long!", "That is long!", "It's short"])
1

Feature Extraction

The feature_extraction module mimicks the module https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text with a focus on character-based extraction.

The main differences are:

  • it is slightly faster;

  • the features can be incrementally updated;

  • it is possible to fit only a random sample of factors to reduce space and computation time.

The main entry point for this module is the CountVectorizer class, which mimicks its scikit-learn counterpart (also named CountVectorizer). It is in fact very similar to sklearn’s CountVectorizer using char or char_wb analyzer option from that module.

class bof.feature_extraction.CountVectorizer(n_range=5, preprocessor=None, filename=None, path='.')[source]

Counts the factors of a list of document.

Parameters
  • preprocessor (callable, optional) – Preprocessing function to apply to texts before adding them to the factor tree.

  • n_range (int or None, optional) – Maximum factor size. If None, all factors will be extracted.

  • filename (str, optional) – If set, load from corresponding file.

  • path (str or Path, optional) – If set, specify the directory where the file is located.

features_

Dictionary that maps factors to their index in the list.

Type

dict of str -> int

Examples

Build a vectorizer limiting factor size to 3:

>>> vectorizer = CountVectorizer(n_range=3)

Build the factor matrix of a corpus of texts.

>>> corpus = ["riri", "fifi", "rififi"]
>>> vectorizer.fit_transform(corpus=corpus).toarray() 
array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0],
       [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32)

List the factors in the corpus:

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']
property features

Get the list of features (internally, features are stored as a dict that associates factors to indexes).

Returns

features – List of factors.

Return type

list of str

fit(corpus, reset=True)[source]

Build the features. Does not build the factor matrix.

Parameters
  • corpus (list of str.) – Texts to analyze.

  • reset (bool) – Clears current features and corpus. Features will be updated instead.

Return type

None

Examples

We compute the factors of a corpus.

>>> vectorizer = CountVectorizer(n_range=3)
>>> vectorizer.fit(["riri", "fifi", "rififi"])

The fit method does not return anything, but the factors have been populated:

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']

We fit another corpus.

>>> vectorizer.fit(["riri", "fifi"])

The factors have been implicitly reset (rif is gone in this toy example):

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi']

We keep pre-existing factors by setting reset to False:

>>> vectorizer.fit(["rififi"], reset=False)

The list of features has been updated (with rif`):

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']
fit_transform(corpus, reset=True)[source]

Build the features and return the factor matrix.

Parameters
  • corpus (list of str.) – Texts to analyze.

  • reset (bool, optional) – Clears factors. If False, factors are updated instead.

Returns

A sparse matrix that indicates for each document of the corpus its factors and their multiplicity.

Return type

csr_matrix

Examples

Build a FactorTree from a corpus of three documents:

>>> vectorizer = CountVectorizer(n_range=3)
>>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray() 
array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0],
       [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32)

List of factors (of size at most 3):

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']

Build a FactorTree from a corpus of two documents.

>>> vectorizer.fit_transform(["fifi", "rififi"]).toarray() 
array([[2, 2, 1, 2, 1, 1, 0, 0, 0],
       [2, 2, 1, 3, 2, 2, 1, 1, 1]], dtype=uint32)

Notice the implicit reset, as only factors from “fifi” and “rififi” are present:

>>> vectorizer.features
['f', 'fi', 'fif', 'i', 'if', 'ifi', 'r', 'ri', 'rif']
>>> vectorizer.m
9

With reset set to False, we can add another list without discarding pre-existing factors.

>>> vectorizer.fit_transform(["riri"], reset=False).toarray() 
array([[0, 0, 0, 2, 0, 0, 2, 2, 0, 1, 1, 1]], dtype=uint32)

Notice the presence of empty columns, which corresponds to pre-existing factors that do not exist in “riri”.

The size and list of factors:

>>> vectorizer.m
12
>>> vectorizer.features
['f', 'fi', 'fif', 'i', 'if', 'ifi', 'r', 'ri', 'rif', 'rir', 'ir', 'iri']

Setting n_range to None will compute all factors.

>>> vectorizer.n_range = None
>>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray() 
array([[2, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 3, 0, 0, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1]],
       dtype=uint32)
property m

Get the number of features.

Returns

m – Number of factors.

Return type

int

no_none_range()[source]

Replace None n_range by 0 before passing to cython code

Return type

None

sampling_fit(corpus, reset=True, sampling_rate=0.5, seed=42)[source]

Build a partial factor tree where only a random subset of factors are selected. Note that there is no sampling_fit_transform method, as mutualizing the processes would introduce incoherences in the factor description: you have to do a sampling_fit followed by a transform.

Parameters
  • corpus (list of str) – Texts to analyze.

  • reset (bool) – Clears FactorTree. If False, FactorTree will be updated instead.

  • sampling_rate (float) – Probability to explore factors starting from one given position in the text.

  • seed (int) – Seed of the random generator.

Return type

None

Examples

We fit a corpus to a tree a normal way to see the complete list of factors of size at most 3..

>>> vectorizer = CountVectorizer()
>>> vectorizer.fit(["riri", "fifi", "rififi"])
>>> vectorizer.features
['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi']

Now we use a sampling fit instead. Only a subset of the factors are selected.

>>> vectorizer.sampling_fit(["riri", "fifi", "rififi"])
>>> vectorizer.features
['i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'r', 'ri', 'rif', 'rifi', 'rifif']

We random fit another corpus. We reset the seed to reproduce the example above.

>>> vectorizer.sampling_fit(["riri", "fifi"])

The factors have been implicitly reset.

>>> vectorizer.features
['i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi']

We add another corpus to the fit by setting reset to False:

>>> vectorizer.sampling_fit(["rififi"], reset=False)

The list of features has been updated:

>>> vectorizer.features
['i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'ifif', 'ififi']
transform(corpus)[source]

Build factor matrix from the factors already computed. New factors are discarded.

Parameters

corpus (list of str.) – Texts to analyze.

Returns

The factor count of the input corpus NB: if reset is set to False, the factor count of the pre-existing corpus is not returned but is internally preserved.

Return type

csr_matrix

Examples

To start, we fit a corpus:

>>> vectorizer = CountVectorizer(n_range=3)
>>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray() 
array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0],
       [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32)

The factors are:

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']

We now apply a transform.

>>> vectorizer.transform(["fir", "rfi"]).toarray() 
array([[1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0]], dtype=uint32)

The features have not been updated. For example, the only factors reported for “rfi” are “r”, “i”, “f”, and “fi”. Factors that were not fit (e.g. rf) are discarded.

bof.feature_extraction.build_end(n_range=None)[source]

Return a function of a starting position s and a text length l that tells the end of scanning text from s. It avoids to test the value of n_range all the time when doing factor extraction.

Parameters

n_range (int or None) – Maximal factor size. If 0 or None, all factors are considered.

Return type

callable

Examples

>>> end = build_end()
>>> end(7, 15)
15
>>> end(13, 15)
15
>>> end = build_end(5)
>>> end(7, 15)
12
>>> end(13, 15)
15
bof.feature_extraction.number_of_factors(length, n_range=None)[source]

Return the number of factors (with multiplicity) of size at most n_range that exist in a text of length length. This allows to pre-allocate working memory.

Parameters
  • length (int) – Length of the text.

  • n_range (int or None) – Maximal factor size. If None, all factors are considered.

Returns

The number of factors (with multiplicity).

Return type

int

Examples

>>> l = len("riri")
>>> number_of_factors(l)
10
>>> number_of_factors(l, n_range=2)
7

Common

The common module contains miscellaneous classes and functions.

class bof.common.MixInIO[source]

Provide basic save/load capacities to other classes.

load(filename: str, path='.')[source]

Load instance from file.

Parameters
  • filename (str) – The stem of the filename.

  • path (str or Path, optional) – The location path.

save(filename: str, path='.', erase=False, compress=False)[source]

Save instance to file.

Parameters
  • filename (str) – The stem of the filename.

  • path (str or Path, optional) – The location path.

  • erase (bool) – Should existing file be erased if it exists?

  • compress (bool) – Should gzip compression be used?

Examples

>>> import tempfile
>>> from bof.feature_extraction import CountVectorizer
>>> vect1 = CountVectorizer(n_range=3)
>>> vect1.fit(["riri", "fifi", "rififi"])
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     vect1.save(filename='myfile', compress=True, path=tmpdirname)
...     vect2 = CountVectorizer(filename='myfile', path=Path(tmpdirname))
>>> vect2.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     vect1.save(filename='myfile', compress=True, path=tmpdirname)
...     dir_content = [f.name for f in Path(tmpdirname).glob('*')]
...     vect2 = CountVectorizer(filename='myfile', path=Path(tmpdirname))
...     vect1.save(filename='myfile', compress=True, path=tmpdirname) # doctest.ELLIPSIS
File ...myfile.pkl.gz already exists! Use erase option to overwrite.
>>> dir_content
['myfile.pkl.gz']
>>> vect2.m
12
>>> from bof.fuzz import Process
>>> p1 = Process()
>>> p1.fit(["riri", "fifi", "rififi"])
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     p1.save(filename='myfile', compress=True, path=tmpdirname)
...     p2 = Process(filename='myfile', path=Path(tmpdirname))
>>> p2.extractOne("rififo")
('rififi', 63.1578947368421)
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     p1.save(filename='myfile', path=tmpdirname)
...     p1.save(filename='myfile', path=tmpdirname) # doctest.ELLIPSIS
File ...myfile.pkl already exists! Use erase option to overwrite.
>>> vect1.fit_transform(["titi"], reset=False).toarray()
array([[0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1]], dtype=uint32)
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     vect1.save(filename='myfile', path=tmpdirname)
...     vect1.save(filename='myfile', path=tmpdirname, erase=True)
...     vect2.load(filename='myfile', path=tmpdirname)
...     dir_content = [f.name for f in Path(tmpdirname).glob('*')]
>>> dir_content
['myfile.pkl']
>>> vect2.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif', 't', 'ti', 'tit', 'it', 'iti']
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...    vect2.load(filename='thisfilenamedoesnotexist', path=tmpdirname) # doctest.ELLIPSIS
Traceback (most recent call last):
 ...
FileNotFoundError: [Errno 2] No such file or directory: ...
bof.common.default_preprocessor(txt)[source]

Default string preprocessor: trim extra spaces and lower case from string txt.

Parameters

txt (str) – Text to process.

Returns

Processed text.

Return type

str

Examples

>>> default_preprocessor(" LaTeX RuleZ    ")
'latex rulez'
bof.common.make_random_bool_generator(probability_true=0.5)[source]

Provides a (possibly biased) random generator of booleans.

Parameters

probability_true (float, optional.) – Probability to return True.

Returns

random_boolean – A function that returns a random boolean when called.

Return type

callable

Examples

>>> rb = make_random_bool_generator()
>>> set_seed(seed=42)
>>> [rb() for _ in range(10)]
[True, False, False, False, True, True, True, False, False, False]
bof.common.set_seed(seed=42)[source]

Set the seed of the numba random generator (which is distinct from the numpy random generator).

Parameters

seed (int) – The seed number.

Return type

None