A tool for text analysis


Keywords
social, science, lexicon, text, analysis
License
Other
Install
pip install empath==0.9

Documentation

Empath is a tool for analyzing text across lexical categories (similar to LIWC), and also generating new lexical categories to use for an analysis. See our paper.

You can install in python via pip:

pip install empath

Then in a python shell, import like this:

from empath import Empath
lexicon = Empath()

Analyze text over all pre-built categories:

lexicon.analyze("he hit the other person", normalize=True)
# => {'help': 0.0, 'office': 0.0, 'violence': 0.2, 'dance': 0.0, 'money': 0.0, 'wedding': 0.0, 'valuable': 0.0, 'domestic_work': 0.0, 'sleep': 0.0, 'medical_emergency': 0.0, 'cold': 0.0, 'hate': 0.0, 'cheerfulness': 0.0, 'aggression': 0.0, 'occupation': 0.0, 'envy': 0.0, 'anticipation': 0.0, 'family': 0.0, 'crime': 0.0, 'attractive': 0.0, 'masculine': 0.0, 'prison': 0.0, 'health': 0.0, 'pride': 0.0, 'dispute': 0.0, 'nervousness': 0.0, 'government': 0.0, 'weakness': 0.0, 'horror': 0.0, 'swearing_terms': 0.0, 'leisure': 0.0, 'suffering': 0.0, 'royalty': 0.0, 'wealthy': 0.0, 'white_collar_job': 0.0, 'tourism': 0.0, 'furniture': 0.0, 'school': 0.0, 'magic': 0.0, 'beach': 0.0, 'journalism': 0.0, 'morning': 0.0, 'banking': 0.0, 'social_media': 0.0, 'exercise': 0.0, 'night': 0.0, 'kill': 0.0, 'art': 0.0, 'play': 0.0, 'computer': 0.0, 'college': 0.0, 'traveling': 0.0, 'stealing': 0.0, 'real_estate': 0.0, 'home': 0.0, 'divine': 0.0, 'sexual': 0.0, 'fear': 0.0, 'monster': 0.0, 'irritability': 0.0, 'superhero': 0.0, 'business': 0.0, 'driving': 0.0, 'pet': 0.0, 'childish': 0.0, 'cooking': 0.0, 'exasperation': 0.0, 'religion': 0.0, 'hipster': 0.0, 'internet': 0.0, 'surprise': 0.0, 'reading': 0.0, 'worship': 0.0, 'leader': 0.0, 'independence': 0.0, 'movement': 0.2, 'body': 0.0, 'noise': 0.0, 'eating': 0.0, 'medieval': 0.0, 'zest': 0.0, 'confusion': 0.0, 'water': 0.0, 'sports': 0.0, 'death': 0.0, 'healing': 0.0, 'legend': 0.0, 'heroic': 0.0, 'celebration': 0.0, 'restaurant': 0.0, 'ridicule': 0.0, 'programming': 0.0, 'dominant_heirarchical': 0.0, 'military': 0.0, 'neglect': 0.0, 'swimming': 0.0, 'exotic': 0.0, 'love': 0.0, 'hiking': 0.0, 'communication': 0.0, 'hearing': 0.0, 'order': 0.0, 'sympathy': 0.0, 'hygiene': 0.0, 'weather': 0.0, 'anonymity': 0.0, 'trust': 0.0, 'ancient': 0.0, 'deception': 0.0, 'fabric': 0.0, 'air_travel': 0.0, 'fight': 0.0, 'dominant_personality': 0.0, 'music': 0.0, 'vehicle': 0.0, 'politeness': 0.0, 'toy': 0.0, 'farming': 0.0, 'meeting': 0.0, 'war': 0.0, 'speaking': 0.0, 'listen': 0.0, 'urban': 0.0, 'shopping': 0.0, 'disgust': 0.0, 'fire': 0.0, 'tool': 0.0, 'phone': 0.0, 'gain': 0.0, 'sound': 0.0, 'injury': 0.0, 'sailing': 0.0, 'rage': 0.0, 'science': 0.0, 'work': 0.0, 'appearance': 0.0, 'optimism': 0.0, 'warmth': 0.0, 'youth': 0.0, 'sadness': 0.0, 'fun': 0.0, 'emotional': 0.0, 'joy': 0.0, 'affection': 0.0, 'fashion': 0.0, 'lust': 0.0, 'shame': 0.0, 'torment': 0.0, 'economics': 0.0, 'anger': 0.0, 'politics': 0.0, 'ship': 0.0, 'clothing': 0.0, 'car': 0.0, 'strength': 0.0, 'technology': 0.0, 'breaking': 0.0, 'shape_and_size': 0.0, 'power': 0.0, 'vacation': 0.0, 'animal': 0.0, 'ugliness': 0.0, 'party': 0.0, 'terrorism': 0.0, 'smell': 0.0, 'blue_collar_job': 0.0, 'poor': 0.0, 'plant': 0.0, 'pain': 0.2, 'beauty': 0.0, 'timidity': 0.0, 'philosophy': 0.0, 'negotiate': 0.0, 'negative_emotion': 0.0, 'cleaning': 0.0, 'messaging': 0.0, 'competing': 0.0, 'law': 0.0, 'friends': 0.0, 'payment': 0.0, 'achievement': 0.0, 'alcohol': 0.0, 'disappointment': 0.0, 'liquid': 0.0, 'feminine': 0.0, 'weapon': 0.0, 'children': 0.0, 'ocean': 0.0, 'giving': 0.0, 'contentment': 0.0, 'writing': 0.0, 'rural': 0.0, 'positive_emotion': 0.0, 'musical': 0.0}

Or over a specific set of categories:

lexicon.analyze("he hit the other person", categories=["violence"])
# => {'violence': 1.0}

By default, Empath will return raw counts, but you can ask it to normalize over words in the document.

lexicon.analyze("he hit the other person", categories=["violence"], normalize=True)
# => {'violence': 0.2}

You can create new lexical categories for analysis using word embeddings in our VSM:

lexicon.create_category("colors",["red","blue","green"])
# => ["blue", "green", "purple", "purple", "green", "yellow", "red", "grey", "violet", "gray", "blue", "orange", "white", "pink", "yellow", "black", "brown", "brown", "red", "aqua", "turquoise", "blue_color", "colored", "color", "same_shade", "violet", "gray", "grey", "teal", "nice_shade", "coloured", "forest_green", "colored", "different_shade", "colour", "sparkly", "reddish", "beautiful_shade", "greenish", "indigo", "darker_shade", "emerald", "lovely_shade", "tints", "crimson", "dark_purple", "pink", "emerald", "sapphire", "golden", "lighter_shade", "lime_green", "coloured", "bright", "same_color", "specks", "red", "golden_color", "different_shades", "chocolate_brown", "orange", "bluish", "green", "deep_purple", "magenta", "green_color", "dark_shade", "bright_orange", "milky", "lilac", "light_brown", "sparkling", "golden_brown", "silvery", "baby_blue", "blood_red", "pink", "teal", "blue", "yellowish", "turquoise", "same_colour", "sparkly", "aquamarine", "black_color", "white", "cerulean", "perfect_shade", "dark", "speckled", "charcoal", "greyish", "midnight_blue", "emerald_green", "deep_brown", "ocean_blue", "flecks", "amber", "pinkish", "jet_black"]

Then analyze with those categories:

lexicon.analyze("My favorite color is blue", categories=["colors"], normalize=True)
# => {'colors': 0.4}

Right now Empath has three different models you can use to create categories: fiction, nytimes, and reddit. (I'm working on integrating all the different models soon). For now, they have different strengths and weaknesses in terms of generating categories. Nytimes would be better for something like the cold war:

lexicon.create_category("cold_war", ["cold_war"], model="nytimes")
# => ["cold_war", "the_cold_war", "the_Cold_War", "war", "Soviet_threat", "the_end_of_the_cold_war", "Communism", "world_war", "Soviet_empire", "Soviet_power", "Communism", "gulf_war", "Soviet_bloc", "the_Soviet_Union", "communism", "superpowers", "nuclear_age", "nuclear_war", "Soviet_system", "evil_empire", "Soviets", "wars", "arms_race", "Indochina", "detente", "Iran-Iraq_war", "Persian_Gulf_war", "American_power", "new_world_order", "American_involvement", "wartime", "American_foreign_policy", "American_occupation", "the_Soviet_Union's", "Soviet_Communism", "nuclear_arms_race", "the_Korean_War", "military_power", "Persian_Gulf_war", "great_powers", "Marshall_Plan", "the_Second_World_War", "Communist_rule", "the_Warsaw_Pact", "Soviet_military", "Reagan_years", "Reagan_era", "Cuban_missile_crisis", "world_wars", "postwar_period", "Communist_world", "military-industrial_complex", "perestroika", "superpower", "new_war", "Desert_Storm", "space_race", "Mikhail_Gorbachev", "Communist_system", "World_War_II", "nation-building", "the_Vietnam_War", "dictatorship", "South_Vietnam", "Iron_Curtain", "diplomacy", "old_Soviet_Union", "military_buildup", "containment", "German_unification", "Balkans", "gulf_crisis", "revolution", "last_war", "Soviet_era", "dictatorships", "warfare", "glasnost", "Soviet_state", "Communist_regimes", "domestic_politics", "Khrushchev", "American_diplomacy", "postwar_era", "Soviet_economy", "peacetime", "Korean_peninsula", "Allies", "Soviet-American_relations", "cold_war_era", "space_program", "Soviet_occupation", "arms_control", "Soviet_leaders", "World_War_I", "Western_alliance", "military_strategy", "quagmire", "regime", "fascism"]

You can adjust the size of the requested categories. You may not always get a bigger category when you ask for it because we're still filtering on a minimum cosine similarity.

lexicon.create_category("cold_war", ["cold_war"], model="nytimes", size=300)
# => ["cold_war", "the_cold_war", "the_Cold_War", "war", "Soviet_threat", "the_end_of_the_cold_war", "Communism", "world_war", "Soviet_empire", "Soviet_power", "Communism", "gulf_war", "Soviet_bloc", "the_Soviet_Union", "communism", "superpowers", "nuclear_age", "nuclear_war", "Soviet_system", "evil_empire", "Soviets", "wars", "arms_race", "Indochina", "detente", "Iran-Iraq_war", "Persian_Gulf_war", "American_power", "new_world_order", "American_involvement", "wartime", "American_foreign_policy", "American_occupation", "the_Soviet_Union's", "Soviet_Communism", "nuclear_arms_race", "the_Korean_War", "military_power", "Persian_Gulf_war", "great_powers", "Marshall_Plan", "the_Second_World_War", "Communist_rule", "the_Warsaw_Pact", "Soviet_military", "Reagan_years", "Reagan_era", "Cuban_missile_crisis", "world_wars", "postwar_period", "Communist_world", "military-industrial_complex", "perestroika", "superpower", "new_war", "Desert_Storm", "space_race", "Mikhail_Gorbachev", "Communist_system", "World_War_II", "nation-building", "the_Vietnam_War", "dictatorship", "South_Vietnam", "Iron_Curtain", "diplomacy", "old_Soviet_Union", "military_buildup", "containment", "German_unification", "Balkans", "gulf_crisis", "revolution", "last_war", "Soviet_era", "dictatorships", "warfare", "glasnost", "Soviet_state", "Communist_regimes", "domestic_politics", "Khrushchev", "American_diplomacy", "postwar_era", "Soviet_economy", "peacetime", "Korean_peninsula", "Allies", "Soviet-American_relations", "cold_war_era", "space_program", "Soviet_occupation", "arms_control", "Soviet_leaders", "World_War_I", "Western_alliance", "military_strategy", "quagmire", "regime", "fascism", "socialism", "Vietnam", "totalitarianism", "new_Europe", "American_leadership", "long_war", "World_War_II.", "colonial_rule", "the_Persian_Gulf_war", "atom_bomb", "NATO_alliance", "world_affairs", "military_threat", "home_front", "Western_Europe", "Eastern_Europe", "German_reunification", "glasnost", "Stalin", "Iraq_war", "Reagan_Presidency", "military_might", "American_policy", "colonialism", "major_war", "East-West_relations", "Soviet_history", "Soviet_rule", "Russians", "the_Gulf_War", "Atlantic_alliance", "the_Bay_of_Pigs", "democracies", "coups", "old_order", "Islamic_world", "Soviet_leadership", "unification", "Stalinism", "nuclear_threat", "Vietnam_era", "the_Afghan_war", "Gorbachev_era", "the_Vietnam_war", "American_President", "American_military_power", "Western_powers", "American_Government", "Soviet_domination", "foreign_policy", "military_establishment", "new_thinking", "Communist_regime", "Communist_era", "militarism", "isolationism", "the_Persian_Gulf", "first_gulf_war", "upheavals", "Saddam_Hussein's", "reunification", "Second_World_War", "Reagan_Administration", "Eastern_Europe's", "disintegration", "empires", "American_strategy", "civil_war", "Soviet_society", "Western_democracies", "common_enemy", "Communist_state", "Korean_Peninsula", "New_Deal", "the_Marshall_Plan", "Berlin_wall", "American_influence", "American_president", "Communist_dictatorship", "political_struggle", "the_Reagan_Administration", "American_public_opinion", "military_victory", "American_policy_makers", "Central_Europe", "modern_history"]