Recent discussions online have gotten me interested in producing a complete lexicon of my own fiction output. I’d like to find a fan with the scripting chops to feed about four or five hundred .doc and .docx files (plus a few .pdf, .html and .txt files) through a scripting engine and pull out a list of each distinct word I have ever used in my fiction output, ideally with frequency.
There’s a second part to this, which is someone with the linguistics chops to filter that list for words which are forms of the same stem, i.e. “walk”, “walked” and “walking”.
I’m curious what my demonstrated written vocabulary is, and secondarily how many words I’ve coined, re-invented or backformed.
Anybody interested in grinding this for me?