The dictatorship of consortia
on the media they own

CS-401 Applied data analysis project 2021

One's view of the world is based on the information we have on it, and how it is conveyed. We obtain a big part of that information throught mainstream media, and in particular newspapers such as The Guardian or Fox News. As a result, the media has a huge impact on the public’s opinion and can lead to a polarization of opinions. One way to ensure you are well-informed (i.e. you are getting information on various topics) and are not biased towards a particular instance is to gather knowledge from diverse sources (e.g. several media). However, a lot of the media is owned by the same groups and might have converging opinions. This begs the question whether the information we are getting really is diverse, or if the news and opinions published are roughly the same for media belonging to the same entity.

The following video is evidence of such a bias for the Sinclair broadcast group, and we will try to obtain similar observations for more groups.

Are the quotes from news articles sufficient to cluster newspapers by their owners?

We aim to show that it is possible to cluster newspapers by the media group they belong to, using only quotes cited in a media and the people who spoke them as information. Therefore, we will have shown that a diversity of journals does not necessarily imply a diversity of information sources, thus showing that if you want to be well-informed, simply reading different newspapers is not enough.

A sketch of our journey:

We first need some features which may allow us to distinguish between different media. To do so we use the Quotebank dataset, which is a dataset of speaker-attributed quotations that were extracted from diverse English news articles over the period of 2008 to 2020. Due to the sheer scale of the data as well as the great variation of topics over the years, we choose to focus our study on 2020. The resulting dataset contains over 5'000'000 unique quotes and 7'000 newspapers, a normal human wouldn't be able to comprehend all this data unless using ADA techniques.
Since this dataset does not contain all the data we need –such as the media group owners, or the country of origin of the journals– we augment it using an open knowledge base, Wikidata.
Then, we use NLP techniques to construct TF-IDF matrices –a better representation of our data, which we detail later– with the features.
Finally, we project our results –again, more details later– in a 2D-space to visualize and analyze them.

A note before proceeding:

For this whole blogpost, we will interchangeably use the terms newspaper/media/journal, and quotee/citee.

Let's dive into the data

The data we get from the Quotebank-dataset look like this :

data.head()

For now, our only way to identify in which newspaper a quote was published is the url where the quote was discovered. To be able to characterize clusters of newspapers, we need to extend our datasetwith the newspaper's entity, as well as the media group owning it. Because we expect to see media differences based on their country of origin, we also add the home country of the media. Using our best scraping skills we query Wikidata to add those features. This comes at a cost, because not all newspapers have a Wikidata entry and of those, not all have a media group indicated. As a consequence, the reduced dataset contains around 1'000 newspapers which reduces the diversity of our data in exchange for more insight on each data point. Single journals in a group were also removed because they would be irrelevant to our goal to cluster them according to their parent group.

newspaper_info.head()

Now that we have the data we process it

Since for each sample, we have both quotes, and quotees (i.e. the people quoted), it seems natural to try and cluster journals according to these two features, and some measure of their similarity. Let us then try two different methods:

First, we will cluster newspapers according to which speakers are cited in it.
Secondly, we will cluster newspapers according to the text content of their quotes.

For both of these features, we created the associated TF-IDF matrix, namely a matrix [newspaper x speaker] and another [newspaper x token]. In a nutshell, this method allows one to measure the importance of a term (here a citee, or a word) to a document (here a media) in a collection (here, all the media we have kept) based on its number of apparitions. To do so, this method gives to each word in a document a weight proportional to its number of occurrences in document (TF: Term Frequency) and inversely proportional to its 'presence' in the corpus (IDF: Inverse Document Frequency). To quantify the 'presence' of a word in the corpus, we simply count the number of documents in which it occurs –we don't take its occurrences per document into account here.

TF is used to give importance to words that appear often in a document, as they may be characteristic to this document. For example the word 'brain' may often appear in a neuroscience paper, while not so much in a geography one. On the other side, IDF prioritises words unique to each document, or remove recurrent words, which we find in basically every document, and therefore do not bring any information (e.g. stopwords such as 'the' or 'a' in a text).

These matrices give us a representation of the newspaper in the words and speakers spaces. The problem is that the data (our TF-IDF matrices) is inherently high-dimensional (speaker: ~210'000 dimensions, words: ~260'000 dimensions), so we as humans can not directly visualize it, even computers struggle to run cluster algorithm in these dimensions.

How can we analyze data which lies in such a high dimension?

In order to tackle this challenge, we will project the newspapers in a lower dimension space using Singular Value Decomposition (SVD).
This transformation should allow us to extract topics, which correspond to the directions of highest variation, i.e. the ones separating the media the best, and their importance. We can then view the differences between different media along one such axis. Refer to our notebook for more technical details.

Example from a very similar dimensionality reduction technique named PCA. We keep the axes where the variance is maximal because there lies the most of the information.

Let us apply this technique on the speakers matrix first

Each of the following plots represent a projection of the media (as represented in the speakers TF-IDF matrix) along some of the axes mentioned above. The x-axis corresponds to the first axis mentioned in the legend, and the y-axis to the second.
Each point corresponds to a media, and its color indicates the group owning it.

Projection of newspapers based on the speakers along highest variance axes

By applying SVD on the speaker TF-IDF matrix and looking into the speakers contributing the most to each axis, we identify the following trends for each axis:

The axes are sorted by how well they separate the newspapers. As we can see, the best criterion to separate newspapers in two categories is to split them into the ones treating local information (eg: a local journal at Canberra) and the ones treating worldwide information such as the New York Times. From the subsequent axes, we can see that the country of origin of a newspaper is the main criterion in speaker based clustering. This makes sense: UK-media will cite more UK-speakers, while US-media will cite more US-speakers. A good example of that is the Newsquest Media Group, UK's second largest publisher of regional and local newspapers. This group's newspapers are located at the far right of the first plots (dark green dots), and are well clustered far from the others. We can see that media are diverse with respect to their countries of origin. However, even if one is aware they should get their information from different media, are they really that likely to get it from different countries' media? Now let us see if we can differentiate the media inside one country! We will focus on the USA since it is the country with the most data available

Projection of USA newspapers based on the speakers along highest variance axes

Again we extract the trends from the list of top speakers per axes:

We also see (in bright green above) that journals from Townsquare Media explain most of the variance of several axes, namely the 3rd and the 4th and are very well clustered from the rest. Hence although we can visually recognize media from this group, it might be interesting to reproduce our analysis while removing Townsquare Media's newspapers. These are the resulting projections:

Projection of USA newspapers without Townsquare Media based on the speakers along highest variance axes

Clearly we can spot some clusters when projecting on the first few axes. The purple and yellow ones are especially well separated and respectively correspond to a media specialized in local news of North of the US and a US radio media conglomerate.

What we learnt

We have seen that at a worlwide scale as well as at the local US scale, looking at the speakers to differentiate newspaper groups leads to a clustering mostly geographical as opposed to thematic. This is due to local newspapers mentioning local celebrities. We will now look at the same study but from the quoted text perspective and try to see if we can obtain a 'better' clustering of media groups.

What about the quotes?

Now let us apply this technique on the words' matrix (constructed with the words from the quotes). For starters we will consider all the quotes, and not only the ones in US newspapers. We expect there is still some geographical distinction –due for example to cultural differences, or diverging points of interest (e.g. American football is popular in the USA, but not much in other English speaking countries)– However, we do not expect them to be as obvious as with the speakers' matrix, since we expect more topics convergence between newspapers of different countries.

Projection of the newspaper based on the quotes along highest variance axes

Again, using the top words per axis we can identify the trends corresponding to the axes:

Our hypothesis is verified, as we have found topics of divergence between newspapers not related to its geographical origin. However, clustering in these cases produces poor result for identifying media consortiums with the notable exception of the bright green points that can be seen at the top of axes 0-4 projection.

To try and obtain better clustering, we will now only focus on the USA again:

Projection of the newspaper based on the quotes in the USA along highest variance axes

As always, we try and identify the trends corresponding to the first axes:

As you can see, they are similar to the previous ones –with unsurprisingly no "country separation axis" anymore. Visually, the bright green points really stand out, can you guess to which journal they correspond?
Of course it is Townsquare media once again, which stands out. That's interesting: it is not only divergent with respect to its choice of speaker, but also w.r.t. the subjects it treats. We hypothesize it is so particular because it is composed of many *local radio stations*. Once more we will remove this consortium from our analysis, and see what results we get:

Projection of the newspaper based on the quotes in the USA without Town Square along highest variance axes

The trends we identify are as follow:

We can note that axes 0, 1 and 2 are identical to the previous ones. Axis 3 is quite similar also, but if you delve into the details of the words identifying it, you can see they are less related to specific states (e.g. words "Wyoming" or "Montana"), but rather to states in general. The projections are now clearer (especially along axes 1-4 (yellow/purple)), and that for 2 reasons mainly:

The withdrawal of articles from Townsquare media, which had a high variance among several axes. It removed their influence, and hence the SVD projection could take into account the other media more.
The removal of some points makes it easier to see the rest of them.

In general, projections originating from the speakers-matrix give better visually separated results, yet they are less meaningful (in the sense that the separation is mostly geographical) than the ones originating from the words-matrix.
Indeed, one can see that graphs from the latter are more entangled.
Therefore, although both matrices allow for some clustering of the owner groups, we deem the 2nd one to be more useful.

Note that if you are still thirsting for more exploration, and you don't want to rerun our code with your own parameter, we have prepared for you some interactive bokeh-plots. One can access them by clicking the links on the upper-left of our website.

Conclusion

From all these projections we have found 3 types of media:

Media belonging to some groups –e.g. Townsquare Media– are clustered together on some axes x and y, and are therefore easily identifiable.
There are also media belonging to a group which we can't as easily uniquely identify, for example because they are part of a few-colored (e.g. bicolor or tricolor) cluster. Still, we can exhibit a trend on these few groups, since they are clustered together.
And finally, there are a number of groups which we have not managed to visually separate through these 2D-projections –see groups whose media are always split all over, or in the middle of some big blob, e.g. Hearst Communications. Refer to the Bokeh plots to see this well)

The fact that a process as simple as ours is able to exhibit media of groups 1 and 2 mean that these media have obvious bias which is correlated to the group to which they belong. This bias may be speaker-wise, where some people are put upfront more than they should, or word-wise, where some subjects are brought up more than they should, or opinions supported excessively.

In any case, if an unaware viewer gets their information from media in this group, whether one or several, they are bound to be biased too.
Note: we deem our process as simple because:

It uses very little information, namely only

the names of the people quoted in articles of the media
or the words used in all quotes in articles of a media, seen as a non-ordered set. which furthermore is easily accessible; you can get it either by yourself, or using some preexisting dataset such as the one we used: Quotebank.

The algorithms on which it relies are conceptually relatively simple, which we described earlier:

the TF-IDF transformation, which we explained earlier
the Singular Value Decomposition.

Potential caveats and overall outlook

Some pitfalls that may have occured:

While our analysis made some biases clear, it is deeply linked to the dataset we used –here, Quotebank. That is why it is important to know how the latter was generated. We advise one uses data that has been reviewed either by a neutral authority or by the community. This a paramount parameter as indeed the saying claims: "garbage in, garbage out".

Future work:

A first step would be to better the visualizations: for now we limit ourselvers to projection along the axis found by SVD, and therefore only illustrate distanciations along a few axes. We could instead try to find some representation taking into account the distances between media along all their axes..
Moreover, we could proceed to more semantic analysis for quotes (better than simple bag/set of words) e.g. by using advanced techniques such as deep learning for NLP which would effectively procure deeper information but as a tradeoff would be heavier computationally to do.

The dictatorship of consortia on the media they own

CS-401 Applied data analysis project 2021

The dictatorship of consortia
on the media they own