The Corpus of British Fiction (CBF)

The CBF is an opportunistically compiled linguistic resource

The CBF consists entirely of novels and a few short story collections published between 1900 and 2019 by writers born and/or educated in the United Kingdom, i.e. England, Scotland, Wales or Northern Ireland. When in doubt about the Britishness of a writer, Wikipedia has been consulted. If a writer is said to have been born in the UK and/or spent most of his/her formative years there, texts produced by that writer have been included.

As far as possible, only adult fiction has been included. Science fiction and fantasy have also been left out. In the end, a corpus compiler of mainly 20th century fiction is dependent on the tastes and preferences of the many volunteers out there who have taken the time and effort to scan, proofread and make books available to the public at large. The CBF would not have materialised without these people, and we owe them a big thanks. Their names, if available, have been included in the header accompanying each text in the corpus.

At the time of writing (August 2023), the corpus consists of approx. 1,470 texts, > 120 million words, eight genres (adventure, crime, general, historical, humour, romance, spy, war) and more than 500 authors, of which approx. 1/3 is female. Note that the corpus is not balanced in the sense that there are equal number of books and words in each decade, nor can it be said to be representative of all (sub-)genres of fiction, as it contains many more crime novels and novels labelled general fiction, than e.g. romance or adventure novels.

A substantial number of the texts have been downloaded from Project Gutenberg (https://www.gutenberg.org/) and Fadedpage (https://www.fadedpage.com/), which means that the spelling is not consistently British English. This means that if one, for instance, is interested in words or expressions where the spelling between US and UK varies, one must make sure to include both alternatives when exploring the corpus. Moreover, the extent to which certain publishers, e.g. in the US, have changed the text/spelling according to in-house styles when publishing the American version is unclear.

Free, public access to the corpus is limited due to restrictions on distribution and copyright.

Published July 25, 2023 12:29 PM - Last modified July 25, 2023 12:29 PM