Landmarks#
Introduced in v0.4, this high-level module allows deeper analysis of a small corpus by using individual query results for the embedding.
- class gismo.landmarks.Landmarks(source=None, to_text=None, **kwargs)[source]#
The Landmarks class is a subclass
Corpus
. It offers the capability to batch-rank all its entries against aGismo
instance. After it has been processed, a Landmarks can be used to analyze/classifyGismo
queries,Cluster
, orLandmarks
.Landmarks also offers the possibility to reduce a source or a gismo to its neighborhood. This can be useful if the source is huge and one wants something smaller for performance.
- Parameters:
source (list) – The list of items that form Landmarks.
to_text (function) – The function that transforms an item into text
kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from
DEFAULT_LANDMARKS_PARAMETERS
.
Examples
Landmarks lean on a Gismo. We can use a toy Gismo to start with.
>>> corpus = Corpus(toy_source_text) >>> embedding = Embedding() >>> embedding.fit_transform(corpus) >>> gismo = Gismo(corpus, embedding) >>> print(toy_source_text) ['Gizmo is a Mogwaï.', 'This is a sentence about Blade.', 'This is another sentence about Shadoks.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.', 'In chinese folklore, a Mogwaï is a demon.']
Landmarks are constructed exactly like a Gismo object, with a source and a to_text function.
>>> landmarks_source = [{'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'}, ... {'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'}, ... {'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'}, ... {'name': 'Shadoks', 'content': 'Shadoks is a French sarcastic show.'},] >>> landmarks = Landmarks(landmarks_source, to_text=lambda e: e['content'])
The
fit()
method compute gismo queries for all landmarks and retain the results.>>> landmarks.fit(gismo)
We run the request Yoda and look at the key landmarks. Note that Gremlins comes before Star Wars. This is actually correct in this small dataset: the word Yoda only exists in one sentence, which contains the words Gremlins and Gizmo.
>>> success = gismo.rank('yoda') >>> landmarks.get_landmarks_by_rank(gismo) [{'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'}, {'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'}, {'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'}]
For better readibility, we set the item post_processing to return the name of a landmark item.
>>> landmarks.post_item = lambda lmk, i: lmk[i]['name'] >>> landmarks.get_landmarks_by_rank(gismo) ['Gremlins', 'Star Wars', 'Movies']
The balance adjusts between documents and features spaces. A balance set to 1.0 focuses only on documents.
>>> success = gismo.rank('blade') >>> landmarks.get_landmarks_by_rank(gismo, balance=1) ['Movies']
A balance set to 0.0 focuses only on features. For blade, this triggers Shadoks as a secondary result, because of the shared word sentence.
>>> landmarks.get_landmarks_by_rank(gismo, balance=0) ['Movies', 'Shadoks']
Landmarks can be used to analyze landmarks.
>>> landmarks.get_landmarks_by_rank(landmarks) ['Gremlins', 'Star Wars']
See again how balance can change things. Here a balance set to 0.0 (using only features) fully changes the results.
>>> landmarks.get_landmarks_by_rank(landmarks, balance=0) ['Shadoks']
Like for
Gismo
, landmarks can provide clusters.>>> success = gismo.rank('gizmo') >>> landmarks.get_landmarks_by_cluster(gismo) {'landmark': {'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'}, 'focus': np.float64(0.999998...), 'children': [{'landmark': {'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'}, 'focus': 1.0, 'children': []}, {'landmark': {'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'}, 'focus': 1.0, 'children': []}, {'landmark': {'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'}, 'focus': 1.0, 'children': []}]}
We can set the post_cluster attribute to customize the output. Gismo provides a simple display.
>>> from gismo.post_processing import post_landmarks_cluster_print >>> landmarks.post_cluster = post_landmarks_cluster_print >>> landmarks.get_landmarks_by_cluster(gismo) F: 1.00. - Gremlins - Star Wars - Movies
Like for
Gismo
, parameters like k, distortion, or resolution can be used.>>> landmarks.get_landmarks_by_cluster(gismo, k=4, distortion=False, resolution=.9) F: 0.03. - F: 0.93. -- F: 1.00. --- Gremlins --- Star Wars -- Movies - Shadoks
Note that a
Cluster
can also be used as reference for theget_landmarks_by_rank()
andget_landmarks_by_cluster()
methods.>>> cluster = landmarks.get_landmarks_by_cluster(gismo, post=False) >>> landmarks.get_landmarks_by_rank(cluster) ['Gremlins', 'Star Wars', 'Movies']
Yet, you cannot use anything as reference. For example, you cannot use a string as such.
>>> landmarks.get_landmarks_by_rank("Landmarks do not use external queries (pass them to a gismo") # doctest.ELLIPSIS Traceback (most recent call last): ... TypeError: bad operand type for unary -: 'NoneType'
Last but not least, landmarks can be used to reduce the size of a source or a
Gismo
. The reduction is controlled by the x_density attribute that tells the number of documents each landmark will allow to keep.>>> landmarks.parameters.x_density = 1 >>> reduced_gismo = landmarks.get_reduced_gismo(gismo) >>> reduced_gismo.corpus.source ['This is another sentence about Shadoks.', 'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.']
Side remark #1: in the constructor, to_text indicates how to convert an item to str, while ranking_function specifies how to run a query on a
Gismo
. Yet, it is possible to have the text conversion handled by the ranking_function.>>> landmarks = Landmarks(landmarks_source, rank=lambda g, q: g.rank(q['content'])) >>> landmarks.fit(gismo) >>> success = gismo.rank('yoda') >>> landmarks.post_item = lambda lmk, i: lmk[i]['name'] >>> landmarks.get_landmarks_by_rank(gismo) ['Star Wars', 'Movies', 'Gremlins']
However, this is bad practice. When you only need to customize the way an item is converted to text, you should stick to to_text. ranking_function is for more elaborated filters that require to change the default way gismo does queries.
Side remark #2: if a landmark item query fails (its text does not intersect the gismo features), the default uniform projection will be used and a warning will be issued. This may yield to undesired results.
>>> landmarks_source.append({'name': 'unrelated', 'content': 'unrelated.'}) >>> landmarks = Landmarks(landmarks_source, to_text=lambda e: e['content']) >>> landmarks.fit(gismo) >>> success = gismo.rank('gizmo') >>> landmarks.post_item = lambda lmk, i: lmk[i]['name'] >>> landmarks.get_landmarks_by_rank(gismo) ['Shadoks', 'unrelated']
- fit(gismo, **kwargs)[source]#
Runs gismo queries on all landmarks. The relevance results are used to build two set of vectors: x_vectors is the vectors on the document space; y_vectors is the vectors on the document space. On each space, vectors are summed to build a direction, which is a sort of vector summary of the landmarks.
- gismo.landmarks.get_direction(reference, balance)[source]#
Converts a reference object into a n+m direction (dense or sparse depending on reference type).
- Parameters:
reference (Gismo or Landmarks or Cluster or np.ndarray or csr_matrix.) – The object from which a direction will be extracted.
balance (float in range [0.0, 1.0]) – The trade-off between documents and features. Set to 0.0, only the feature space will be used. Set to 1.0, only the document space will be used.
- Returns:
A n+m direction.
- Return type:
np.ndarray or csr_matrix