Attributes and Methods in LexicalRichness
This addendum exposes the underlying lexicalrichness measures from attributes and methods in the LexicalRichness class.
TTR: Type-Token Ratio (Chotlos 1944, Templin 1957)
- lexicalrichness.LexicalRichness.ttr()
Type-token ratio (TTR) computed as t/w, where t is the number of unique terms/vocab, and w is the total number of words. (Chotlos 1944, Templin 1957)
- Returns
Type-token ratio
- Return type
Float
RTTR: Root Type-Token Ratio (Guiraud 1954, 1960)]
- lexicalrichness.LexicalRichness.rttr()
Root TTR (RTTR) computed as t/sqrt(w), where t is the number of unique terms/vocab, and w is the total number of words. Also known as Guiraud’s R and Guiraud’s index. (Guiraud 1954, 1960)
- Returns
Root type-token ratio
- Return type
FLoat
CTTR: Corrected Type-Token Ratio (Carrol 1964)
- lexicalrichness.LexicalRichness.cttr()
Corrected TTR (CTTR) computed as t/sqrt(2 * w), where t is the number of unique terms/vocab, and w is the total number of words. (Carrol 1964)
- Returns
Corrected type-token ratio
- Return type
Float
Herdan: Herdan’s C (Herdan 1960, 1964)
- lexicalrichness.LexicalRichness.Herdan()
Computed as log(t)/log(w), where t is the number of unique terms/vocab, and w is the total number of words. Also known as Herdan’s C. (Herdan 1960, 1964)
- Returns
Herdan’s C
- Return type
Float
Summer: Summer (Summer 1966)
- lexicalrichness.LexicalRichness.Summer()
Computed as log(log(t)) / log(log(w)), where t is the number of unique terms/vocab, and w is the total number of words. (Summer 1966)
- Returns
Summer
- Return type
Float
Dugast: Dugast (Dugast 1978)
- lexicalrichness.LexicalRichness.Dugast()
Computed as (log(w) ** 2) / (log(w) - log(t)), where t is the number of unique terms/vocab, and w is the total number of words. (Dugast 1978)
- Returns
Dugast
- Return type
Float
Maas: Maas (Maas 1972)
- lexicalrichness.LexicalRichness.Maas()
Maas’s TTR, computed as (log(w) - log(t)) / (log(w) * log(w)), where t is the number of unique terms/vocab, and w is the total number of words. Unlike the other measures, lower maas measure indicates higher lexical richness. (Maas 1972)
- Returns
Maas
- Return type
Float
yulek: Yule’s K (Yule 1944, Tweedie and Baayen 1998)
yulei: Yule’s I (Yule 1944, Tweedie and Baayen 1998)
Herdan’s Vm (Herdan 1955, Tweedie and Baayen 1998)
Simpson’s D (Simpson 1949, Tweedie and Baayen 1998)
msttr: Mean Segmental Type-Token Ratio (Johnson 1944)
- lexicalrichness.LexicalRichness.msttr(self, segment_window=100, discard=True)
Mean segmental TTR (MSTTR) computed as average of TTR scores for segments in a text.
Split a text into segments of length segment_window. For each segment, compute the TTR. MSTTR score is the sum of these scores divided by the number of segments. (Johnson 1944)
See also
segment_generator
Split a list into s segments of size r (segment_size).
- Parameters
segment_window (int) – Size of each segment (default=100).
discard (bool) – If True, discard the remaining segment (e.g. for a text size of 105 and a segment_window of 100, the last 5 tokens will be discarded). Default is True.
- Returns
Mean segmental type-token ratio (MSTTR)
- Return type
float
mattr: Moving Average Type-Token Ratio (Covington 2007, Covington and McFall 2010)
- lexicalrichness.LexicalRichness.mattr(self, window_size=100)
Moving average TTR (MATTR) computed using the average of TTRs over successive segments of a text.
Estimate TTR for tokens 1 to n, 2 to n+1, 3 to n+2, and so on until the end of the text (where n is window size), then take the average. (Covington 2007, Covington and McFall 2010)
See also
list_sliding_window
Returns a sliding window generator (of size window_size) over a sequence
- Parameters
window_size (int) – Size of each sliding window.
- Returns
Moving average type-token ratio (MATTR)
- Return type
float
mtld: Measure of Textual Lexical Diversity (McCarthy 2005, McCarthy and Jarvis 2010)
- lexicalrichness.LexicalRichness.mtld(self, threshold=0.72)
Measure of textual lexical diversity, computed as the mean length of sequential words in a text that maintains a minimum threshold TTR score.
Iterates over words until TTR scores falls below a threshold, then increase factor counter by 1 and start over. McCarthy and Jarvis (2010, pg. 385) recommends a factor threshold in the range of [0.660, 0.750]. (McCarthy 2005, McCarthy and Jarvis 2010)
- Parameters
threshold (float) – Factor threshold for MTLD. Algorithm skips to a new segment when TTR goes below the threshold (default=0.72).
- Returns
Measure of textual lexical diversity (MTLD)
- Return type
float
hdd: Hypergeometric Distribution Diversity (McCarthy and Jarvis 2007)
- lexicalrichness.LexicalRichness.hdd(self, draws=42)
Hypergeometric distribution diversity (HD-D) score.
For each term (t) in the text, compute the probabiltiy (p) of getting at least one appearance of t with a random draw of size n < N (text size). The contribution of t to the final HD-D score is p * (1/n). The final HD-D score thus sums over p * (1/n) with p computed for each term t. Described in McCarthy and Javis 2007, p.g. 465-466. (McCarthy and Jarvis 2007)
- Parameters
draws (int) – Number of random draws in the hypergeometric distribution (default=42).
- Returns
Hypergeometric distribution diversity (HD-D) score
- Return type
float
vocd: vod-D (Mckee, Malvern, and Richards 2010)
- lexicalrichness.LexicalRichness.vocd(self, ntokens=50, within_sample=100, iterations=3, seed=42)
Vocd score of lexical diversity derived from a series of TTR samplings and curve fittings.
Vocd is meant as a measure of lexical diversity robust to varying text lengths. See also hdd. The vocd is computed in 4 steps as follows.
Step 1: Take 100 random samples of 35 words from the text. Compute the mean TTR from the 100 samples.
Step 2: Repeat this procedure for samples of 36 words, 37 words, and so on, all the way to ntokens (recommended as 50 [default]). In each iteration, compute the TTR. Then get the mean TTR over the different number of tokens. So now we have an array of averaged TTR values for ntoken=35, ntoken=36,…, and so on until ntoken=50.
Step 3: Find the best-fitting curve from the empirical function of TTR to word size (ntokens). The value of D that provides the best fit is the vocd score.
Step 4: Repeat steps 1 to 3 for x number (default=3) of times before averaging D, which is the returned value.
See also
ttr_nd
TTR as a function of latent lexical diversity (d) and text length (n).
- Parameters
ntokens (int) – Maximum number for the token/word size in the random samplings (default=50).
within_sample (int) – Number of samples for each token/word size (default=100).
iterations (int) – Number of times to repeat steps 1 to 3 before averaging (default=3).
seed (int) – Seed for the pseudo-random number generator in ramdom.sample() (default=42).
- Returns
voc-D
- Return type
float
Helper: lexicalrichness.segment_generator
- lexicalrichness.segment_generator(List, segment_size)
Split a list into s segments of size r (segment_size).
- Parameters
List (list) – List of items to be segmented.
segment_size (int) – Size of each segment.
- Yields
List – List of s lists of with r items in each list.
Helper: lexicalrichness.list_sliding_window
- lexicalrichness.list_sliding_window(sequence, window_size=2)
Returns a sliding window generator (of size window_size) over a sequence. Taken from https://docs.python.org/release/2.3.5/lib/itertools-example.html
Example:
List = [‘a’, ‘b’, ‘c’, ‘d’]
window_size = 2
- list_sliding_window(List, 2) ->
(‘a’, ‘b’)
(‘b’, ‘c’)
(‘c’, ‘d’)
- Parameters
sequence (sequence (string, unicode, list, tuple, etc.)) – Sequence to be iterated over. window_size=1 is just a regular iterator.
window_size (int) – Size of each window.
- Yields
List – List of tuples of start and end points.
Helper: lexicalrichness.frequency_wordfrequency_table