Top 3 frequent words with case-insensitive counting

Problem. Return the top 3 most frequent words from a list.

Rules:

  • count words case-insensitively
  • return the answer in lowercase
  • if two words have the same frequency, the lexicographically smaller word comes first
  • if there are fewer than 3 distinct words, return all of them

Example:

words = ["Hello", "world", "hello", "World", "HELLO", "foo"]

Answer:

["hello", "world", "foo"]

This is a frequency table + ranking problem.

Solution. Normalize each word to lowercase, count frequencies, then sort by:

  • frequency descending
  • word ascending
from collections import Counter


def top_3_frequent_words(words):
	counts = Counter(word.lower() for word in words)
	ranked = sorted(counts.items(), key=lambda item: (-item[1], item[0]))
	return [word for word, _ in ranked[:3]]

Runtime is O(n + m log m) where m is the number of distinct words. Extra space is O(m).

Gotchas.

  • Apply lowercase before counting, not after.
  • The tie-break is lexicographic order on the normalized lowercase form.
  • For arbitrary top k, a heap can reduce the ranking phase to O(m log k).

Run: top 3 frequent words