Attributes and Methods in LexicalRichness

This addendum exposes the underlying lexicalrichness measures from attributes and methods in the LexicalRichness class.

TTR: Type-Token Ratio (Chotlos 1944, Templin 1957)

lexicalrichness.LexicalRichness.ttr()

Type-token ratio (TTR) computed as t/w, where t is the number of unique terms/vocab, and w is the total number of words. (Chotlos 1944, Templin 1957)

Returns

Type-token ratio

Return type

Float


RTTR: Root Type-Token Ratio (Guiraud 1954, 1960)]

lexicalrichness.LexicalRichness.rttr()

Root TTR (RTTR) computed as t/sqrt(w), where t is the number of unique terms/vocab, and w is the total number of words. Also known as Guiraud’s R and Guiraud’s index. (Guiraud 1954, 1960)

Returns

Root type-token ratio

Return type

FLoat


CTTR: Corrected Type-Token Ratio (Carrol 1964)

lexicalrichness.LexicalRichness.cttr()

Corrected TTR (CTTR) computed as t/sqrt(2 * w), where t is the number of unique terms/vocab, and w is the total number of words. (Carrol 1964)

Returns

Corrected type-token ratio

Return type

Float


Herdan: Herdan’s C (Herdan 1960, 1964)

lexicalrichness.LexicalRichness.Herdan()

Computed as log(t)/log(w), where t is the number of unique terms/vocab, and w is the total number of words. Also known as Herdan’s C. (Herdan 1960, 1964)

Returns

Herdan’s C

Return type

Float


Summer: Summer (Summer 1966)

lexicalrichness.LexicalRichness.Summer()

Computed as log(log(t)) / log(log(w)), where t is the number of unique terms/vocab, and w is the total number of words. (Summer 1966)

Returns

Summer

Return type

Float


Dugast: Dugast (Dugast 1978)

lexicalrichness.LexicalRichness.Dugast()

Computed as (log(w) ** 2) / (log(w) - log(t)), where t is the number of unique terms/vocab, and w is the total number of words. (Dugast 1978)

Returns

Dugast

Return type

Float


Maas: Maas (Maas 1972)

lexicalrichness.LexicalRichness.Maas()

Maas’s TTR, computed as (log(w) - log(t)) / (log(w) * log(w)), where t is the number of unique terms/vocab, and w is the total number of words. Unlike the other measures, lower maas measure indicates higher lexical richness. (Maas 1972)

Returns

Maas

Return type

Float


yulek: Yule’s K (Yule 1944, Tweedie and Baayen 1998)


yulei: Yule’s I (Yule 1944, Tweedie and Baayen 1998)


Herdan’s Vm (Herdan 1955, Tweedie and Baayen 1998)


Simpson’s D (Simpson 1949, Tweedie and Baayen 1998)


msttr: Mean Segmental Type-Token Ratio (Johnson 1944)

lexicalrichness.LexicalRichness.msttr(self, segment_window=100, discard=True)

Mean segmental TTR (MSTTR) computed as average of TTR scores for segments in a text.

Split a text into segments of length segment_window. For each segment, compute the TTR. MSTTR score is the sum of these scores divided by the number of segments. (Johnson 1944)

See also

segment_generator

Split a list into s segments of size r (segment_size).

Parameters
  • segment_window (int) – Size of each segment (default=100).

  • discard (bool) – If True, discard the remaining segment (e.g. for a text size of 105 and a segment_window of 100, the last 5 tokens will be discarded). Default is True.

Returns

Mean segmental type-token ratio (MSTTR)

Return type

float


mattr: Moving Average Type-Token Ratio (Covington 2007, Covington and McFall 2010)

lexicalrichness.LexicalRichness.mattr(self, window_size=100)

Moving average TTR (MATTR) computed using the average of TTRs over successive segments of a text.

Estimate TTR for tokens 1 to n, 2 to n+1, 3 to n+2, and so on until the end of the text (where n is window size), then take the average. (Covington 2007, Covington and McFall 2010)

See also

list_sliding_window

Returns a sliding window generator (of size window_size) over a sequence

Parameters

window_size (int) – Size of each sliding window.

Returns

Moving average type-token ratio (MATTR)

Return type

float


mtld: Measure of Textual Lexical Diversity (McCarthy 2005, McCarthy and Jarvis 2010)

lexicalrichness.LexicalRichness.mtld(self, threshold=0.72)

Measure of textual lexical diversity, computed as the mean length of sequential words in a text that maintains a minimum threshold TTR score.

Iterates over words until TTR scores falls below a threshold, then increase factor counter by 1 and start over. McCarthy and Jarvis (2010, pg. 385) recommends a factor threshold in the range of [0.660, 0.750]. (McCarthy 2005, McCarthy and Jarvis 2010)

Parameters

threshold (float) – Factor threshold for MTLD. Algorithm skips to a new segment when TTR goes below the threshold (default=0.72).

Returns

Measure of textual lexical diversity (MTLD)

Return type

float


hdd: Hypergeometric Distribution Diversity (McCarthy and Jarvis 2007)

lexicalrichness.LexicalRichness.hdd(self, draws=42)

Hypergeometric distribution diversity (HD-D) score.

For each term (t) in the text, compute the probabiltiy (p) of getting at least one appearance of t with a random draw of size n < N (text size). The contribution of t to the final HD-D score is p * (1/n). The final HD-D score thus sums over p * (1/n) with p computed for each term t. Described in McCarthy and Javis 2007, p.g. 465-466. (McCarthy and Jarvis 2007)

Parameters

draws (int) – Number of random draws in the hypergeometric distribution (default=42).

Returns

Hypergeometric distribution diversity (HD-D) score

Return type

float


vocd: vod-D (Mckee, Malvern, and Richards 2010)

lexicalrichness.LexicalRichness.vocd(self, ntokens=50, within_sample=100, iterations=3, seed=42)

Vocd score of lexical diversity derived from a series of TTR samplings and curve fittings.

Vocd is meant as a measure of lexical diversity robust to varying text lengths. See also hdd. The vocd is computed in 4 steps as follows.

Step 1: Take 100 random samples of 35 words from the text. Compute the mean TTR from the 100 samples.

Step 2: Repeat this procedure for samples of 36 words, 37 words, and so on, all the way to ntokens (recommended as 50 [default]). In each iteration, compute the TTR. Then get the mean TTR over the different number of tokens. So now we have an array of averaged TTR values for ntoken=35, ntoken=36,…, and so on until ntoken=50.

Step 3: Find the best-fitting curve from the empirical function of TTR to word size (ntokens). The value of D that provides the best fit is the vocd score.

Step 4: Repeat steps 1 to 3 for x number (default=3) of times before averaging D, which is the returned value.

See also

ttr_nd

TTR as a function of latent lexical diversity (d) and text length (n).

Parameters
  • ntokens (int) – Maximum number for the token/word size in the random samplings (default=50).

  • within_sample (int) – Number of samples for each token/word size (default=100).

  • iterations (int) – Number of times to repeat steps 1 to 3 before averaging (default=3).

  • seed (int) – Seed for the pseudo-random number generator in ramdom.sample() (default=42).

Returns

voc-D

Return type

float


Helper: lexicalrichness.segment_generator

lexicalrichness.segment_generator(List, segment_size)

Split a list into s segments of size r (segment_size).

Parameters
  • List (list) – List of items to be segmented.

  • segment_size (int) – Size of each segment.

Yields

List – List of s lists of with r items in each list.


Helper: lexicalrichness.list_sliding_window

lexicalrichness.list_sliding_window(sequence, window_size=2)

Returns a sliding window generator (of size window_size) over a sequence. Taken from https://docs.python.org/release/2.3.5/lib/itertools-example.html

Example:

List = [‘a’, ‘b’, ‘c’, ‘d’]

window_size = 2

list_sliding_window(List, 2) ->

(‘a’, ‘b’)

(‘b’, ‘c’)

(‘c’, ‘d’)

Parameters
  • sequence (sequence (string, unicode, list, tuple, etc.)) – Sequence to be iterated over. window_size=1 is just a regular iterator.

  • window_size (int) – Size of each window.

Yields

List – List of tuples of start and end points.


Helper: lexicalrichness.frequency_wordfrequency_table