[writing|tech] Looking for a reader with some scripting chops

Recent discussions online have gotten me interested in producing a complete lexicon of my own fiction output. I’d like to find a fan with the scripting chops to feed about four or five hundred .doc and .docx files (plus a few .pdf, .html and .txt files) through a scripting engine and pull out a list of each distinct word I have ever used in my fiction output, ideally with frequency.

There’s a second part to this, which is someone with the linguistics chops to filter that list for words which are forms of the same stem, i.e. “walk”, “walked” and “walking”.

I’m curious what my demonstrated written vocabulary is, and secondarily how many words I’ve coined, re-invented or backformed.

Anybody interested in grinding this for me?

  1. KittenHerder says:

    I would love to do this for you, but you can also do it for yourself (or perhaps one of your tribe there could assist). I think this program would do everything you want and more: http://www.lexically.net/wordsmith/step_by_step_English6/index.html If you do not feel that you or one of your local tribe can handle this, let me know and I will gladly take this own. The software looks WONDERFUL, and since I have a strong fascination with machine language processing I would love an excuse to play with this. – Robin

    1. Jay says:

      I’m kind of overloaded being a terminal cancer patient, which is why I’m looking for help with this in the first place. If it goes on my own to do list, or that of the people immediately round me, it will *never* happen. But thank you for the tool reference.

      1. KittenHerder says:

        I totally get that. Again, I would gladly take on the project for you, both to help you explore the lexical depths of your works and as a project to familiarize myself with a tool that covers an area that I have been interested in for ages.

  2. j.r. murdock says:

    Hey Jay. My fulltime job is web design, but I do a lot of db and script work. I’d love to jump in and help as well. I could build a simple script to digest your documents and create a simple database of words and counts. Sounds like fun actually.

