“Stable Genius” – Let’s Go to the Data

So, as always. First the headline, then you need to eat your vegetables to get the details.

The headline:

By any metric to measure vocabulary, using more than a half dozen tests with different methodologies, Donald Trump has the most basic, most simplistically constructed, least diverse vocabulary of any President in the last 90 years. This is by a statistically significant margin in each case.

Okay, the headline’s out of the way. On to the vegetables, so you understand why we checked this, and the methodology.

(And with our apologies for the simplistic charts. The Google Sheets plug in is quick and dirty… but the data’s all there for you at the bottom)

Why Are You Blogging on a Sunday Night?

Well, the Golden Globes are on. Also…

I usually try to unplug over the weekend. And by unplug, I mean “catch up on everything I was supposed to do during the week but didn’t because who the hell can get work done during office hours.” You know, by relaxing and stuff.

So the emails that started coming in Saturday morning around 8 a.m. kind of interfered with that plan. I ignored them for all of 20 seconds before seeing what the heck was going on. In general, when something is going on, the emails tend to clump together. The phone wasn’t going to stop vibrating by force of will alone.

Turned out, it was a number of folks asking if I’d seen the “genius” tweet, and if Factba.se had ever run an intelligence test.

Now, when someone emails me at an ungodly hour (and prior to 11 am on a weekend more than qualifies, given my normal bedtime is defined as “Thursday”) to ask about a tweet, I put the darker thoughts out of my mind and did my best not to get upset.

But I was awake. May as well spoil it. The tweet in question (a three-parter, which is more unusual of late since the character limit was upped):

…spanning 11 minutes. (Sorry about that last one… one of my favorite Road Runners).

The quote that seemed to stick out in everyone’s mind was the last one: “I think that would qualify as not smart, but genius….and a very stable genius at that!”

Okay, I was awake.

Apparently, the intellectual exercise would be to parse the phrase “genius” and could it be proven, or disproven.

Into the Den of Snopes

Measuring intelligence is normally done through a simple method with no agreed upon standard: an IQ test, a loosely-defined standardized test, variations of which have been in use for more than a century. The most common one in modern use is the the Wechsler Adult Intelligence Scale (WAIS) v4, in use since 2008.

However, there is no peer-reviewed method to look at writing / speeches / etc to assess intelligence. The closest is a 2006 study, which used a historiometric method.

Suffice it to say, that method is fine, but it takes a doctorate and an expert. We don’t have presidential scholars at Factba.se. We’re a bunch of data schmoos. Also, this particular study was ripped off and faked enough in the past 15 years that it has multiple snopes pages (here, here, and here) and it rates its own Wikipedia page. Again, the study is fine. Making stuff up around it isn’t.


However, the ability to measure the complexity of vocabulary, the diversity and its comprehension level is something we do all the time here in the Fact Cave, courtesy of Margaret, our platform’s AI. In fact, it’s done every time we add a word into the platform, automagically. The most common metric, the Flesch-Kincaid Grade Level, was actually developed for the military in the 1970s as a way to check that training materials were appropriate and could be understood by its personnel. It is used as a measurement in legislation to ensure documents such as insurance policies can be understood.

There are a number of competing algorithms. They use different approaches, but all try to do one of two things:

  • Grade Level. Establish the grade level at which the text could be understood
  • Reading Ease. Essentially the same thing, but with a normalized statistical score vs. a U.S.-centric grade level.

At Factba.se, Margaret runs every single bit of text automatically through the following algorithms:

… and about a dozen others, including difficult word count, etc. We’re also testing the Lexile Framework.

As a side benefit, recreationally, we built a database of interviews, speeches and press conferences for previous presidents, leaning heavily on what’s available publicly from presidential libraries, and the wonderful collections at the University of California, Santa Barbara’s American Presidency Project. One of the reasons we did this is to provide a point of contrast. Looking at a single datapoint can tell you everything and nothing. A nice cohort comparison… that’s better.

Importantly, as we’ve blogged earlier, we like to focus on a person’s own words if possible, not speechwriters. The UCSB archive in particular gave us a rich trove of Presidential press conferences back to Herbert Hoover in 1929. So we could look at just what a president said. Unscripted (or as close an approximation as is possible for a president).

Okay. We had the algorithms. We had the text. On to…


As mentioned previously, we narrowed our samples from Hoover forward to just press conferences, presidential debates and interviews. Of course, within those, we only use words spoken by the President, nothing else.

This left us with a deep sample for each, but spread out. We ran the analysis two ways:

  • Complete. Whatever we have, we have. On the low end, it’s 44,705 words for Gerald Ford, up to 1,124,164 words for Bill Clinton. Trump clocked in second at 915,801 words.
  • Equal Sample. We then ran the same test on 30,000 words, plus or minus 1% (actual range was 30,003 – 30,253 words), where we looked only within the person’s presidency (no pre-election debates) and started from Inauguration Day forward, adding sentences until we hit 30,000, then stopped and analyzed those.

In addition, we’ve been testing the Lexile framework. It’s a free test so we’re limited to 1,000 words. But we took the first 1,000 words (in full sentence format) from the equal sample and tested those.

It’s important to note: for the two presidents where social media existed, this was not included. This was strictly utilizing the responses given by a president in an interview, during a press conference, or in a political debate.

The Result

It statistically made no difference which way we analyzed it, or which method. It affected some scores and some of the ranks, but not the position of Donald Trump on that list. In each case, he ranked last of the past 15 presidents.

By every metric and methodology tested, Donald Trump’s vocabulary and grammatical structure is significantly more simple, and less diverse, than any President since Herbert Hoover, when measuring “off-script” words, that is, words far less likely to have been written in advance for the speaker.

Significant is not editorializing. The gap between Trump and the next closest president (in most indices, Harry Truman, known historically for a folksy, simple pattern of speech), is larger than any other gap using Flesch-Kincaid. Statistically speaking, there is a significant gap.

This gap appears both when using the complete corpus available to us for all presidents, and the more limited 30,000 word set to use an equal data set for each. In either data set, Donald Trump consistently clocks in at the bottom of the list. Depending on the scale used, it’s between a 3rd and 7th grade reading level.

Using the same one used by the Department of Defense, the grade level on the equal sample is 4.6. That’s between a fourth and fifth grade level.

The next closest is Truman at 5.9, followed by Bush 41 at 6.7. The top three: Herbert Hoover (11.3), Jimmy Carter (10.7) and Barack Obama (9.7).

In terms of word diversity and structure, Trump averages 1.33 syllables per word, which all others average 1.42 – 1.57 words. In terms of variety of vocabulary, in the 30,000-word sample, Trump was at the bottom, with 2,605 unique words in that sample while all others averaged 3,068 – 3,869. The exception: Bill Clinton, who clocked in at 2,752 words in our unique sample.

So What?

That’s a fair question. So what? Vocabulary is not a proxy for intelligence. In IQ Tests, vocabulary is a component, but only a component.  However, it is used as a proxy for a number of things:

  • Doctors use it to measure symptoms of degenerative brain diseases (note: as blogged previously, we see no downward trend over 40 years in Trump’s vocabulary. For unscripted, it’s very consistent).
  • Psychologists use vocabulary as a measure of intellectual curiosity and a person’s reading ability.

But also, it should be pointed out:

  • Politicians strive to get a clear, concise message in front of the public. That includes keeping it short and simple.

Other than Donald Trump, all presidents in this cohort were either career politicians, or in the case of Eisenhower, a very public figure and military leader for decades before running for president (historians argue whether a general at Eisenhower’s level would already be considered a politician before running for office, due to the need to navigate very political waters at that level).

Back to so what? In answer to those who emailed the equivalent of “is the president a stable genius”, the answer is “we don’t know.” Short of IQ tests, there’s no way to know for sure.

But what we can say is, compared to the 14 presidents who preceded him, by every measure, his use of words when off script are significantly less diverse, and simpler, than all presidents who preceded him back to Herbert Hoover.

As always, feel free to dispute the analysis, but come prepared with data. We don’t need more opinions. But more analysis with supporting data is always welcome.

Here’s the data. Have fun!

[Note: Hmm… thought the plug in would download all the tabs, not just one. Oh well. This is the Google Sheets link

Flesch-Kincaid Grade Level
Flesch-Kincaid Reading Ease
Gunning Fog Index
Coleman-Liau Index
SMOG Index
Automated Readability Index
Dale Chall Readability Score
Average Words / Sentence
Average Syllables / Word
Word Count
Unique Words
Lexile Score (Low)
Lexile Score (High)
Lexile Word Count
Chart Label
45Donald Trump4.6082.507.407.705.803.202.8011.591.3330,2522,605600700974Trump
33Harry S. Truman5.9072.707.809.906.904.804.4011.251.4530,0093,246800900896Truman
41George H. W. Bush6.7071.909.109.607.305.904.0013.981.4330,1323,308700800925Bush 41
43George W. Bush7.4067.009.8010.207.906.404.4014.111.4831,0643,29710001100951Bush 43
32Franklin D. Roosevelt7.4070.7010.009.207.806.603.5016.151.4230,0953,40911001200908FDR
36Lyndon B. Johnson7.6067.7010.2010.008.206.904.2015.431.4630,0363,19113001400975LBJ
40Ronald Reagan8.0068.5010.709.708.107.604.1017.241.4330,0383,48511001200974Reagan
35John F. Kennedy8.8063.7011.2010.109.008.304.6017.941.4830,0033,28911001200953Kennedy
42Bill Clinton9.3064.6012.009.609.009.004.1020.231.4430,0402,75211001200916Clinton
37Richard Nixon9.4063.3012.109.809.209.004.5019.871.4630,0223,06812001300939Nixon
34Dwight D. Eisenhower9.4064.1012.209.609.009.204.2020.571.4430,0053,26911001200945Eisenhower
38Gerald Ford9.4060.8011.8010.309.408.704.7018.521.5030,0923,28713001400923Ford
44Barack Obama9.7058.1012.1011.209.509.305.0018.231.5430,0063,86912001300979Obama
39Jimmy Carter10.7054.6013.3011.4010.2010.405.2020.221.5630,0703,62411001200973Carter
31Herbert Hoover11.3051.9014.4011.4010.9011.004.9021.381.5730,1703,47112001300912Hoover

The Howard Stern-Donald Trump Interviews

The Stern Thing

[Update: 9/27/17: Audio has been removed per DMCA notice from SiriusXM. Think it should be public? Feel free to let @SternShow and @SiriusXM know.]

[Update: 9/30/17: TrumpOnStern.com kindly pointed out we missed two Stern shows. Donald Trump appeared on November 9, 1995 for 22 minutes and January 20, 1994 for at least 8 minutes (the audio is not complete). The post below reflects the data without those two shows included.]

Be careful what you wish for. It could screw up your month.

So… the Howard Stern / Donald Trump interviews. It’s been a bit of an obsession of ours. But not for the reasons you might think.

There have been some articles written before the election about Howard Stern, primarily by Andrew Kaczynski and Nate McDermott at Buzzfeed and later at CNN, Virginia Heffernan at Politico, David Fahrenhold at The Washington Post and others, including Mother Jones and The Atlantic.

These all quoted excerpts from these interviews. By our count, we found about 20 minutes of audio total covering about a dozen interviews.

If you’ve listened to Howard Stern before, you know you can find something salacious without a great deal of effort, and the interviews with Donald Trump were no exception.

However, the stories (with the exception of Heffernan’s excellent piece) didn’t address what we thought were two key points.

  1. Howard Stern is an excellent interviewer. Guests can spend two hours or longer speaking with Stern. His staff preps him well and they are impeccably researched, and move from making out with girls to port security in Dubai effortlessly. Howard Stern gets people to speak about things that, in any other context, they would never discuss.
  2. Based on our research, no one has spent more time interviewing Donald Trump publicly than Howard Stern, both in terms of the length of the interviews, the number, and over a larger period of time.

We wanted that record for our database. It’s a gaping hole.

But therein lies the problem. Howard Stern has done, conservatively more than 8,000 shows since the 1980s, and that number is probably low. Based on the normal length, that’s at least 30,000 hours of audio and likely a minimum of 50,000,000 words. And there’s no definitive record. If Stern has the list, it hasn’t been shared.

We’ve found snippets and pieces before. But, per our mission, we want to ensure that anything in our database is the full transcript, versus an excerpt. As such, we were interested in the full record of conversations between Donald Trump and Howard Stern from the 1990s forward. To make sure we had it all, we wanted the whole show to check.

Our research indicated he was on the show dozens of times, but not the details, exact dates, etc. We reached out to people who operate fan sites, particularly marksfriggin.com, and on the Internet, particularly via Reddit. Stern fans are known for collecting recordings of old shows, so we were hoping to find the full recordings,

We were  insulted in ways both creative and thorough, but kept trying. In short, we struck out. By the spring, we had shifted our focus to building out the features on the site.

And Suddenly…

Out of the blue, early in the morning September 5th, about 3 1/2 months after we had moved on, we received an email with a Dropbox link from an anonymous Yahoo account. We looked and to our surprise, it was several dozen MP3s with the entire show, end-to-end, which allowed us to verify we were capturing the entire interview. We copied the MP3s and quickly emailed back to ask a couple of clarifying questions. We were not-so-politely told to leave them alone.

Between the files and extensive research on marksfriggin.com and other sites, we were able to verify 35 unique interviews, beginning May 8, 1993 on Howard Stern’s E! interview show, through August 25, 2015. There were other MP3s, but they contained Stern talking about Trump, or a time when Trump was supposed to dial in but couldn’t, or in one case, a re-run. We filtered those out.

So we got to work, transcribing, proofreading, cross-checking. This is harder than it sounds. Our transcription robots are good. But the show is fast paced (235 words per minute by our measure), filled with crosstalk, music and other sound effects in the background, noise. It mixes clean audio with phone audio. It’s the greatest hits list of “things that mess with algorithms.”

Combine that with a mixed bag of recording methods, and our robot was none too happy with us. So it involved a lot more manual work than we like.

The transcripts are complete but we’ll be working them towards perfection for some time. But they’re just about there, and married to the audio, and run through our usual battery of audio, text and voice analysis. (And please, when in doubt, listen to the audio).

But after investing more than two weeks, there’s just too much to do. We’ll keep tweaking in our spare time, but there’s only so many hours of the day before you start writing your blog posts at 3:45 am. Just sayin’.

Yeah Yeah Yeah. Whatcha Got?

Donald Trump’s time on Howard Stern totals 15 hours, 8 minutes and 52 seconds, with 104,357 words spoken by Donald Trump. This is 21% longer than his first book, “The Art of the Deal” (86,575 words). Hell, it’s almost half as long as the Frost / Nixon Interviews.

Based on our records, this is far more time Trump has spent in an interview than any other journalist or media personality, including Morning Joe, Sean Hannity, Bill O’Reilly, Chris Matthews, Larry King, Don Imus… any of them. This is in terms of the number of interviews, the length, the time period.

Trump has spent far more time, over a far longer period of time, speaking in greater depth with Howard Stern than any other interviewer. No one has spent more time interviewing Donald Trump in a public setting than Howard Stern, and in particular spanning more than two decades. Having these interviews in our database provides a crucial perspective.

We stopped counting after more than 500 unique questions and answers. Yes, lots of questions about sex, positions, his views on women, and things you don’t find in any other interviews (AIDS, Chlamydia, group sex, groping in public… our robot keyworded a lot of new things… we chose not to teach our AI some things. It leads to scary things). But also, lots on North Korea, Iraq, infrastructure and taxation. The Port of Dubai security was a real question. And it was answered.

Some of the stories Trump told repeat themselves across multiple years. He discusses a great deal about his personal life. And most of the interviews had a specific hook: boxing matches Trump was promoting, new books, The Apprentice and, toward the end of the series, a great deal more about politics.

We also had to develop a custom taxonomy and classification. A good many of the questions and answers are, in Stern’s style, leading. For example, an oft-quoted excerpt from a 42-minute interview had the following segment:

Donald Trump: My daughter is beautiful, Ivanka. She…
Howard Stern: By the way, your daughter.
Donald Trump: She’s beautiful.
Howard Stern: Can I say this? A piece of ass.
Donald Trump: Yeah.

He didn’t say his daughter was “a piece of ass.” However, he did not argue the point.

This follows a pattern throughout the interviews of Stern making a statement as a question and Trump either confirming or denying the statement without repeating it. Trump first explicitly stated he wouldn’t answer a question on September 23, 2004, his 20th interview with Stern. As the interviews evolved closer to 2015, the rate of objections increased.

The interviews begin on May 8, 1993, before Tiffany and Barron were born, Eric was 9, Ivanka was 12 and Don Jr. 16. He had just divorced his first wife, Ivana, and was dating Marla Maples. The last interview was on August 25, 2015, two months after he announced his 2016 presidential run. He and Melania had been married a decade, his children were married and he had starred in two famous television shows.

So Is This Everything?

We are almost sure we have them all. Daily records of Stern’s show prior to 1997 are difficult to find. Is it possible we missed one? Absolutely. But we’re pretty sure we’ve got them all. If we’re wrong, we’d love to know the dates and get to work transcribing.

Also, please check the audio. We think we did a good job tagging who is speaking. But when in doubt, hit play. And if we’re wrong, let us know so we can fix it.

You can find all the transcripts here:

Howard Stern – Donald Trump Interviews

They’re also in the general search, of course. The audio files can be found on SoundCloud, or you can download them all here.

9/27/17 – Audio has been removed per DMCA notice from SiriusXM. Think it should be public? Feel free to let @SternShow and @SiriusXM know.


What Makes Donald Trump Uncomfortable? A Statistical Analysis

So let’s get the headline out of the way. Donald Trump is not at all comfortable discussing God. That’s based on more than three hours of video covering more than 424 distinct segments spanning more than 200 events.

That’s why you probably clicked here. Now, you get a data science explainer before you get the data. We’re so bait-and-switch.

As part of a set of new features we’re deploying (see our Emotion Subtitles), we generated a huge amount of data from our new approach to Voice Stress Analysis. Each second of audio and video gets individually analyzed, as well as 10-second segments, sequential segments, and the entire speech, interview or press conference.

This compilation opened up an interesting opportunity for analysis. Since our data is extensively tagged and structured, we could document, statistically, exactly what makes him relax, and what makes him tense. So we thought: cool.

A Word about Voice Stress Analysis

You’ll read a lot about voice stress analysis. So let’s address one thing here: it’s not a lie detector test. This is hotly debated, and we prefer to stick with the known. It has not been proven definitively that increases in voice stress indicate lies. If a person believes a lie, they will be relaxed. If a person steps on a tack, stress will increase even if telling the truth.

What this does definitely detect is a level of comfort, stress and/or anxiety. The higher the frequency (due to muscles contracting, including muscles in the neck that affect the voice box, thus the frequency), the greater the indication of stress. By measuring patterns when this occurs, we can identify statements and topics where a person is not comfortable with what they are saying. Coupled with identification the underlying feelings and measuring factors such as word choice and rate of speech, among several dozen others (we gather 115 datapoints per word), it’s a powerful way to uncover how someone feels about what they are saying.

It doesn’t tell you WHY they’re stressed or anxious. They just are. When used in an individual conversation, you don’t have context. The person can just be having a bad day. Or a great day.

That’s why the next part is important: we have hundreds of hours of Trump documented, transcribed and keyworded. A bad day is possible. 200 bad days on the same topic? Unlikely. In fact, we did a basic statistical model and found the odds of having “a bad day” on 200 or more unique days exactly when a particular topic being discussed was… some big number. Excel showed one of those 1e12 things and we just moved on.

Back to Why You’re Here.

So we ran the data. The methodology is important, which we’ll explain in detail:

  • Eliminate Bias. To remove bias, we selected only topics that Trump has discussed publicly 200 or more times, according to our database. Every one of those topics / subjects was checked and is reflected below.
  • Find Midpoint. For each interview, speech, event, and so on, an individual middle (median) point was established for just Trump’s voice. So if he was having a relaxed day, we measured when topics moved the stress above or below that midpoint. If he was having a bad day, same thing.
  • Phrase subjectivity. For phrases, we freely admit this was subjective. We checked our database for frequently used phrases and it found thousands. It’s a literal beast, so “I am going” appears in the list of three-word phrases. We punted and googled “Trump catch phrases” and selected about a dozen. We made a subjective choice to add “Make America Great Again” into the mix, as well as “Thanks”, “Thank You”, “God Bless You” and “God Bless America” into our checks, based on the findings in our topical analysis.
  • “You’re Fired” We eliminated “You’re fired” since most of the references were short, pre-recorded clips from the television show vs. a real-world situation.
  • Short Segments. We eliminated any segment less than four seconds long, as that can add anomalous spikes, and we want the phrase or topic in context.
  • Sample size. This got us to 170.23 hours of video, spanning 30,899 unique segments (1- to 3-word sentences are a unit in our database based on size), from 1980 through this week, covering 1,634,208 words.
  • <nerd>This then fed into our algorithm, which is an Adaptive Empirical Mode Decomposition (AEMD) process, to check for deviations outside of 8-12Hz. This is widely recognized as the normal frequency range to monitor. When it goes above 12Hz, it’s considered stress…</nerd>
  • A reminder… but again, we use the midpoint from a particular event, to account for the fact that being President probably is stress in and of itself.

One note: you will see topics on the table below with less than 200 citations. Our check of topics included print interviews, his writings and tweets, indicating it is a topic he frequently discusses, but may be represented less than 200 times in the audio and video.

And from that data…

Back to the Lede

Trump is clearly, statistically, uncomfortable expressing gratitude. When he thanks people, based on 67 unique segments where thanking someone was the topic, and another 105 phrase references to thanking someone, he is consistently at an elevated stress level, indicating anxiety.

Similarly, when discussing God as a topic (424 unique segments), he is also uncomfortable, with his voice indicating stress and anxiety well above the midpoint established contextually in the conversation. Note this is specific to discussing God, vs any particular religion or religion itself.

Rounding out the top list of uncomfortable topics and phrases:

  • Make America Great Again” (32 segments)
  • Build the Wall” / “Build That Wall” (153 segments)
  • The White House (as an institution – 323 segments)
  • Veterans (402 segments)
  • Law Enforcement (194 segments)
  • The Wall (as a topic – 790 segments)

Okay, but what puts him at ease? On what topics is he comfortable?
The top of the list is what our system classified as “inner cities” but in looking at specific references, it’s discussions of urban planning, cities and infrastructure. He’s well below the stress midpoint when on this topic (145 references). A good number of these references were in interviews pre-dating his Presidency as well.

The Middle East is strongly represented on the list of topics where he is comfortable: Iraq (420 references), Iran (406 references), Syria (281 references) and the Middle East in general (305 references) are all points where he is clearly relaxed and not anxious when discussing.

Rounding out this list of topics where he is comfortable:

  • War (248 references)
  • The New York Times (89 references)
  • Terrorism (366 references)
  • A lot of money” (275 references)
  • Many many” (126 references)

So was there anything else surprising?

Personally, for me, there were a few things, but the world doesn’t need another opinion right now, so take a look at the data below and decide for yourself. If you disagree with anything in the methodology, let us know. But be warned: we make available all our data on request, and will continue to do so. If you disagree with the points above, we’re happy to send you the algo and all the underlying data for you to verify the results for yourself, or to run through a different process. The world not needing another opinion doesn’t just apply to me :-). We’re all about data and verifiable facts at Factba.se, so you’re welcome to think we’re wrong, but be ready for us to challenge you to prove we’re wrong.


[Click to Enlarge]


Topic / Phrase Deviation Score # of Segments Length of Segments [HH:MM:SS]
Phrase: “God Bless You”, “God Bless America” 1.4990 159 01:03:30
Phrase: “Thanks”, “Thank you” 1.3732 105 00:37:33
Phrase: “Make America Great Again” 1.1865 32 00:03:45
Thanks / Thanking Someone 0.9275 67 00:30:36
God 0.7604 424 02:54:00
Phrase: “Build the Wall” / “Build that Wall” 0.6085 153 00:41:48
The White House 0.6006 323 02:09:06
Veterans 0.5056 402 02:17:28
Law Enforcement 0.4312 194 01:31:37
The Wall 0.3307 790 02:57:00
Education 0.2488 371 01:40:01
Illegal Immigration 0.2254 109 00:46:39
North Carolina 0.2074 228 01:17:23
Phrase: “Believe Me” 0.2056 834 04:38:59
Phrase: “Sad”, “So Sad” 0.1921 456 02:52:36
Obamacare 0.1822 800 04:36:54
Senate 0.1538 182 01:07:52
United States 0.1529 3128 21:07:33
Congress 0.1500 297 02:15:39
Records 0.1378 181 01:03:20
Washington 0.1240 420 02:41:43
Iowa 0.1189 324 01:19:37
Phrase: “Winning” 0.1117 444 01:43:23
Israel 0.1080 186 01:24:48
Donald Trump 0.0908 573 02:20:56
Polls 0.0792 388 01:36:50
ISIS 0.0757 727 03:51:00
North Korea 0.0650 122 00:49:56
American People 0.0545 275 02:12:24
Campaign 0.0401 525 03:06:03
Health Care 0.0353 220 01:27:06
Florida 0.0251 412 01:45:09
Mexico 0.0114 1102 04:21:13
Special Interests 0.0086 183 01:28:37
Security -0.0107 313 02:14:26
Russia -0.0290 272 01:33:35
Politicians -0.0354 651 02:55:25
Democrats -0.0367 392 02:20:09
New Hampshire -0.0513 308 01:14:53
Phrase: “Tremendous” -0.0525 1099 06:09:35
Trump Administration -0.0567 692 05:02:17
Law -0.0577 211 01:18:52
China -0.0674 1359 05:21:54
Phrase: “Huge” -0.0723 112 00:34:09
New York -0.0747 440 02:00:39
Media -0.0759 373 02:18:02
Drugs -0.0771 385 01:59:19
Hillary Clinton -0.0971 2690 15:17:47
NAFTA -0.1140 320 01:51:19
Republicans -0.1184 454 02:12:58
Numbers -0.1231 483 02:18:55
Trade -0.1232 693 02:56:48
Ohio -0.1351 332 01:50:52
Border -0.1359 1094 05:55:57
Jobs -0.1366 2158 12:08:53
Barack Obama -0.1440 1231 07:09:13
Japan -0.1742 435 01:25:26
Future -0.1871 299 02:20:56
Phrase: “Many Many” -0.1990 126 00:16:37
Phrase: “A Lot of Money” -0.2124 275 01:16:13
Terrorism -0.2229 366 02:44:03
Middle East -0.2260 305 02:02:45
Syria -0.2381 281 01:47:01
The New York Times -0.2714 89 00:36:41
War -0.2890 248 01:33:08
Iran -0.2923 406 02:12:54
Iraq -0.4850 420 02:09:44
Infrastructure / City Planning -0.6432 145 01:09:01

Trump / Putin Meeting: Who’s Frustrated? Who’s in Control? Ask the Data

Short version

Our robot’s audio analysis of Trump from his two minutes on camera with Putin:  “Disappointed, feeling of missed opportunity. Cold and remote. Pompous, overexcited, empowered.”

Our robot’s audio analysis of Putin from his two minutes on camera with Trump: “Feels in control. Controlled speaking.”

Questions? I hope so. Explanation below. It’s worth it. If not, skip to the Infographic,

Long Version

We’re constantly bringing new processes, techniques and tools online at FactSquared. We’ve been using machine learning for analyzing audio, video and text since we launched (all the way back in… January!)

But we’re also very agnostic about tools. We don’t have all the answers, and we watch this space closely for new developments. When something new comes along, we try it. If it adds value, we integrate into our composite.

Oh people can come up with statistics to prove anything Kent. Forty percent of all people know that.
— Homer Simpson
The Simpsons, S05E11

We were going through a round of testing on a new approach right when Donald Trump met with Vladimir Putin in Hamburg, Germany on Friday, July 7, 2017.

In a peanut-butter-meets-chocolate moment, we said: “let’s try this out!”

In keeping with the past blog posts, you get some background and details. It’s like Neil deGrasse Tyson, but not as funny. Or smart. Or handsome. Or charismatic…

…back from the therapist. All better. Picking it back up…

A huge part of what we do, separate from pulling all this data together, is the analysis. Most of it is behind the scenes because  it’s a lot of data. 115 datapoints per word. Or, in the average 10-minute speech (1,132 words, at current 30-day moving average of Trump’s speeches and remarks of 113.2 words per minute), 130,180 datapoints. You do not want all that on a page.

It all feeds our search engine to make the results hyper-accurate, but the goal has always been a way to surface the information that doesn’t overwhelm. You’ll start seeing some of it in the next few days as we get charts and dashboards on the search, and in our daily newsletter (yes, it’s coming).

Text Analysis

Part of all this is text analytics of course. Using established approaches and methodologies, it analyzes the words and groups of words to score how positive or negative a statement is, what emotions it conveys, the topics of conversation, and so on.

For example, when we analyze word usage to determine odd turns of phrase, or how “normal” a statement is in terms of language, it utilizes the Corpus of Contemporary American English, a statistical compilation of 520 million words across books, newspapers, magazines, books, spoken words from 1990 – 2015 (it’s cool, but bring Dramamine). The raw data we generate is reproducible.

Audio Analysis

The same principle applies to audio analysis, which measures voice stress reliably, as well as comparing the frequency, tremors and other ticks against things like the Toronto Emotional Speech Set or the Berlin Database of Emotional Speech, and quite a few others. From there, we tailor the models, building on top of the core data. Ditto for video. You get the point.

Taking all the above, for example, our current system generated a composite of the Trump / Putin discussion. It described Trump and Putin as very positive. This was of course challenging as the text analysis was off the translator for Putin’s comments. The text emotion reflected “Joy” and “Agreeable” for both.

But, this provides an analysis of the words, not the person.

The audio analysis, which is important, told a different story. It characterized Trump as being moderately positive, but low energy. Putin was characterized at the midline: neither high nor low energy, neither positive nor negative. Put another way, the robot said Trump was upbeat in tone, happy, but lower energy. The same robot said Putin was a cipher. Neutral across the board.

Something Old, Something New

That brings us current. We’ve been meaning to test an expanded voice analysis tool. The company, BeyondVerbal, had built their analysis off of more than 60,000 samples, far larger than the others. Analysis tools such as these are the embodiment of “more is more.” The much larger sample set lets the analysis be much more finely sliced. So we took it out for a spin.

We used the below for Trump…

…and Putin.

Because the tool specifically measures voice frequencies, the camera noise should not impact it. That being said, we tested it anyway after removing the camera noise…

…and found the analysis, with 1-2%, to be nearly identical. Feel free to test on your own to validate.

Our findings are below in the table. The data indicated a more restrained, less confident Trump, while Putin appeared to be in tight control of his voice and more confident of his position. The data, combined with several dozen other tests, also proved to be an improvement on our audio analysis. So we’ll be integrating it into our composite in the coming days.

Audio Analysis


Is Trump Going Senile? (Beta)

No, you didn’t catch us opinion-ating.

We’re getting ready to debut a new daily feature that will try to use data to validate assertions, or to uncover insights that would be impossible without the data collection at Factba.se

It’s not ready to go just yet, and it’s not as pretty as we want it. But the data’s more important than the prettiness.

To that end, please see our first try at this, and let us know what you think. This was based on an article in May in Stat that asserted Trump was potentially going through cognitive decline. The term “senile” was latched on to by the press (and thus this infographic), but it was not used in the article, just the comments. Much of the data cited was due to speaking style and vocabulary.

Well, said we. We have a definitive record spanning 37 years. Let’s take a look.

We focused on two areas: the Flesch-Kincaid Reading Level, that basically scores word complexity, and rate of speech. It’s worth noting that while the average American is between 135-160 words per minute speaking, the average New Yorker is close to 200 words per minute.

We are not offering opinions as to why, but what we can say definitively:

  • Overall, his rate of speech has dropped consistently;
  • This has coincided with a decrease in his unscripted public statements (e.g. interviews) and an increase in his scripted public appearances (speeches, remarks);
  • There is a statistically significant difference in his rate of speech when looking at the type of appearance. Interviews and debates, he speaks much faster. Remarks and speeches, much slower (almost half speed)
  • The complexity of words used in speeches is almost double the grade level of those used in testimony and debates, which are less likely to be scripted.

Statistically, the press conference is a bit of an outlier, though Jennifer (yes, Jennifer) pointed out that he often begins these with a prepared statement, which is included in the vocabulary and the rate of speech, which may skew the results.

That said, let us know what you think.

[Correction 7/1/17: Note the original infographic incorrectly compressed the X axis in terms of years. This does not change or alter the data, but does affect the two timeline charts in the appearance of the data. It has been updated.]

Sigh… Thanks for the Weekend

Life’s little pleasures on a Father’s Day weekend:

  • Cleaning part of the house… just… right.
  • Watching a Pixar flick with your kids.
  • Dunkin’ Donuts without any guilt
  • Getting a 98-page PDF, in tabular format, dropped on your lap at 5:30 pm on a Friday with a year’s worth of financial data. (PDF Here)

Well, I guess we signed up for this.

We worked through this at the beginning of the year. We luckily had three things going for us:

  1. Semi-consistent numbering on the OGE 278e financial forms
  2. Two previous years of clean data in which to compare against the new one
  3. A few handy PDF extraction tools that, while far from perfect, are pretty good and pulling the data out in non-crappy format.

So, that said, still about 10 hours. But the bright side of being hands on is… you learn a lot. For example:

  • Ownership. Basically, anything previously with an owner of “Donald J. Trump” is now shifted to one of the following:
    • DJT Holdings LLC
    • DJT Holdings Managing Member LLC
    • DTTM Operations LLC
    • DTTM Operations Managing Member LLC
    • … or the Donald J Trump Revocable Trust
  • It’s worth noting that the four LLCs mentioned above are all owned by the Donald J Trump Revocable Trust
  • As part of moving around assets, a checking and savings account in excess of $50,000,000 was opened at Capital One on April 12, 2017 for the Donald J. Trump Revocable Trust
  • A bookkeeping thing. The companies listed in his resignation letter from January 19 and the list of resignations in the OGE 278e don’t match or line up neatly. Someone should poke around just to make sure I’s dotted, t’s crossed.

Everything is integrated into our Assets page (https://factba.se/topic/assets).

In addition, we put everything in two Spreadsheets, because nobody should deal with PDFs. We feel strongly about that (with apologies to Adobe).

  • The OGE 278e Financial Disclosure from June 14, 2017 is completed converted to a spreadsheet here: https://goo.gl/4jL9Bo
    It’s embedded below, but save yourself the headache and go straight to the sheet.
  • We put his income, liabilities and portfolio side by side for 2015, 2016, and 2017 in a spreadsheet here: https://goo.gl/MdbrhC

Everything above is a Creative Commons 3.0 Attribution. Put it to good use and crunch away.

Feed Me (Transcripts), Seymour…

If there’s one thing about statistical models that’s generally true: they need to be fed.

For about six months now, I’ve been living most waking moments in the words of Donald Trump. I love algorithms, but I check them. And check them again. And again. It’s not even borderline compulsive. We blew past borderline around January. It is compulsive.

Probably the single biggest challenge I face in shaping the models: access to raw materials. We check, and check again, every word. Yes, he was on Oprah in 1988, but we need more than 3:11… we need the whole show for context.

We are constantly updating our backlog of material, with volunteers generously sending in links (I’m looking at you CJ in particular), text, videos that in turn need to be checked and then fed into Margaret, our pseudo-AI that ravenously consumes every word spoken, analyzing the audio, video and text to build her model. This in turn analyzes tweets, transcribes better, and does lots of other cool things.

The single best source of this information are interviews. As opposed to speeches, they are generally unscripted. As opposed to tweets, you get more than 19.6 words at a time (1 year moving average, 3,203 tweets, 62,871 words). Sometime, I’ll have enough time to do a separate post explaining how different the models view speeches vs. interviews… it’s almost two different people in the output.

However, as Chris Cillizza at CNN pointed out in a recent tweet, these are often unshared, even after the news cycle. Some organizations publish transcripts simultaneously. Most publish just excerpts, noting they’ve been edited. Some share audio and video, but with cuts and jumps. Others… nothing.

I’m not naming names, but given that the messages coming from The White House can at times appear to contradict each other, this raw material is crucial, both for the historical record, and for building a base of research that others can analyze.

Also, the full, unedited interview can remove potential questions as to whether comments are in context. Personally, I think that is in nearly every case a ridiculous argument, but the argument can’t be made if there’s no edits.

So I’m making both a public plea, and an offer: please, in the name of all that is good in the world, once you’ve run your stories and pieces, please publish and share the raw materials. Pull any off-the-record comments, but otherwise, share the raw audio, video and text.

Since everyone has a few things to do nowadays, here’s what we’ll offer for any interview with the President, if time or resources constrains a full transcript or sharing raw video and/or audio.

  1. Factba.se will happily, and freely, transcribe in full any video or audio provided, both via Margaret, and with a human editor to verify.
  2. Factba.se will provide, via a spreadsheet or any other medium, ALL metadata developed. This is the stuff that is behind the scenes (not for long) on our site. If audio, you’ll get back second-by-second audio analysis of voice stress and emotion, which is keyed to Trump (the sotto voce whisper). If video, it will include facial expressions, smile/frown, gestures and other analysis (clothing identification, colors, smile / frown, the two-handed punctuation I myself have as a third-generation bridge-and-tunnel child, etc). It is even learning to pick up when he flushes (complexion change). It will be a lot. But it will be everything.
  3. We will provide the full keyword and entity extraction, by three-sentence pair, section and overall, both for the entire interview, and specifically on just when Trump is speaking.
  4. We will provide the full-range of analysis. Grade-level models, sentiment, emotion… all of it.
  5. We will respect any and all embargoes given. We are not meant to be a news organization. If you’d like us to hold until a day, two days, three days after the stories run before integrating and sharing the information, fine. You’re the boss. It’s your interview. You get it back first and control the story.
  6. If a human is in the mix editing, figure two hours per hour of video/audio for transcript. If you don’t mind raw from Margaret (she’s close to 95% dead on now), 90 seconds per hour. We just need a little notice to plan our day to be ready for it if you want a quick turnaround.
  7. We will, of course, link out to your pieces from the text.
  8. If there are any other requests… fine. Our interest is the record, and sharing the resulting analysis.

We’re not looking to create a hippie commune. We are looking, however, to unleash the data that is contained in your excellent work, in a way that does not conflict with your job.

Also, on the off chance Margaret becomes sentient again, you’ll be in her good graces.

— Bill Frischling

<nerd>When Occam’s Razor Cuts You</nerd>

The past couple of weeks have had a fun on-again, off-again fixing a “legacy” problem, if that term can be applied to a four-month old site: IDing when Trump himself uses @realdonaldtrump, vs his staff.

Trumpologists have known for a while that he used a Samsung Galaxy S3, aka an Android phone. His staff all used iPhones. Further, multiple analyses of the language style, time of day, etc. (nerd out here and here) validated the connection between the Android and tweets from the thumbs of Trump.

But things change. His Android phone was last seen the morning of March 8th:

Then… nothing. Android showed up briefly for two tweets on March 25th within four minutes of each other (I’m picturing a brief wrestling match with the Secret Service as he pulled the phone from its hiding spot in the limo on the way to Trump National in Potomac Falls, VA… the two tweets were during his ride over, 20 minutes before arrival… maybe while on the Toll Road?) and disappeared again.

Meanwhile, there were 139 other tweets, mostly from the iPhone. Including lots of FAKE NEWS references and other tweets that are universally agreed to have come from him.

So, here in the Fact Cave, we’ve had an algorithm for months that looked at his text. The trouble: it essentially biased the heck out of Android as an indicator. When the Android disappeared, the algo rolled over, showed its belly and promptly failed miserably. Back to the drawing board.

So we got to work. And iterated, tested, iterated and iterated some more.

Meanwhile, Andrew McGill at The Atlantic remembered the golden rule we forgot: Always. Be. Shipping. His excellent code can be found here.

We agreed with most of his approach, but now properly motivated, we blew the dust off perfection, hunkered down and reclassified all the tweets from March 8th forward?

The logic? We kept it simple:

  • Control. In 2016, removing retweets, the Android phone tweeted 1,357 times. Other devices tweeted 2,264 times. For our purposes, this was treated as gospel, where Android = Trump, not Android = staff
  • Words. Trump is very distinctive. We generated a deep count of commonly used words on both accounts. Simple word frequency
  • Hashtags. Trump almost never uses hashtags. His staff uses them frequently. Appearance of hashtags biases heavily to Staff
  • URLs and Photos. Outside of retweets, Trump used either a URL or photo a total of 10 times out of 1,357 tweets. Another heavy bias
  • Others. What we tested and ignored: sentiment (Trump tweets bias negative, but not consistently enough to be a factor), user mentions, use of capitalization, use of exclamation points. None of these were as clear as hashtags and URLs.
  • New platforms. Twitter Ads and Media Studio – both social platforms / products unlikely to be in Trump’s hands on his phone, are an automatic staff.
  • Then we guessed…

… well, not really. We took those three factors (for each we did a log odds ratio) and threw it into a test that automatically adjusted the weighting and scores for each of the factors and compared it against a random sample of the control (1,000 items) until it settled on the best outcome.

The best result we could get without relying on the device as a major indicator? The robot can correctly identify Trump tweets 91% of the time, and staff tweets 85% of the time.

Perfect? No. Better than nothing? Yes.

We’re going to keep working on it and getting that number up. And we’ll be faster about it next time.


Factba.se 2.0. Now with 0.2 More!

Okay, been focusing on quite a few things, but we just pushed out a fairly large update, and we’ve got some news as well. But first, the updates from today and the past two weeks:

  • Full Access to transcripts. We’ve been asked about this repeatedly. Now, you can browser through everything Trump has said that we have in the system in a handy timeline here: https://factba.se/transcripts. In addition, it surfaces some of the behind-the-scenes analytics, like emotion analysis, sentiment analysis, keywords, entities and more. Just click on an item to see a detailed breakout. (for example: https://factba.se/transcript/donald-trump-remarks-greek-ceo-march-24-2017).
  • White House Schedule. A simple little doodad. It lists the President’s Schedule (public schedule), broken out as appointments. As analysis comes in, it is linked into the schedule. It’s also available in JSON, CSV, and of course iCal format, as well as in a public Google calendar. https://factba.se/topic/calendar
  • iOS App. This grew out of the consolidated White House feed we did, so everyone can monitor all the White House’s social feeds, website and email list to the press in one spot. We were asked for realtime alerts. Then we thought about an app. Then I said “Hey, how hard can it be to learn to code an iOS app?” Seven days laters, with about four hours of sleep total and three cases of Diet Coke, the keyword-friendly-named “Trump White House Consolidated News Release Feed” app was born. A whopping $0.99, which after using for a year, means we lose money on the push alert costs. But it needed to be done.  http://apple.co/2nEVN7Y

Whew, that’s quite a bit for an update. One more piece of news…

Open Data Access. After a fair amount of discussion, we’ve decided to pursue freely distributing the entire Trump dataset via APIs. This will provide data access to:

  • Complete Transcript Library (3MM+ words) + Meta Data
  • The live Trump Twitter Archive
  • The complete screenshot library of his @realdonaldtrump feed
  • Financial records in data form and mapped to company holdings
  • H1B Filings
  • Court Records

We’ve already started doing that with our live feeds and calendars. Anyone who wants to data mine or come up with new ways of using the data will be free to do so.

We need to get the infrastructure in place, and that may take a couple of weeks, but we’ll have managed public APIs that let you get some, or all, of the data for public use, on the condition the work product is shared publicly as well.

The live White House feed is available freely now as:

The President’s Schedule is available similarly as:

That’s enough of an update for today. Onward.

Factba.se v1.8 – We’re almost to v2

We’ll get to the v2 to release. But first, a couple of notes:

1. We’ve been a bit overwhelmed by the requests to assist — voluntarily — with the site and information collection. If we’ve been slow to follow up, see #3 below.

2. We pushed live an internal tool that we think would be useful. One of our unexpected challenges was the lack of a centralized place to gather new information. Update can appear on Twitter, Facebook, Youtube, the White House site. To that end, we centralized a realtime feed that pulls together:

  • Whitehouse.gov
  • Facebook (DonaldTrump, POTUS, Whitehouse)
  • Instagram (Whitehouse)
  • YouTube (WhiteHouse)
  • Twitter (realDonaldTrump, WhiteHouse, POTUS, VP, Mike_Pence, SeanSpicer)
  • The White House press distribution list (immediate release only)

…and puts it here: https://factba.se/topic/latest . This is the same feed we use to monitor and add all new public statements. The social feeds update realtime. The White House and email are every 60 seconds. You can also plug it in to RSS, hit the JSON directly, or follow it live on the robotic @FactbaseFeed

If you see a source missing, just let us know. As near as we can tell, it’s the only source that monitors everything coming out of the WHPO from all sources.

3. A bit of personal news. Since the election, Factba.se has become an increasing focus of my life in particular. To that end, I left my day job last week as Vice President / Entrepreneur-in-Residence at U.S. News and World Report to dedicate more time and focus on the platform and content. Based on the traffic, it’s getting regular, repeat use in newsrooms. We hope to become only more valuable as time goes on. This includes getting back to the folks in regard to what we need (mostly: tracking down video for documents).

And if you know a good project manager who could take some time to yell at me daily to stop obsessing on minor details, please send them my way… or just randomly call and yell at me. The 120 Jira tickets aren’t going down fast enough :-).