You could do it with a program called, "textcrawler". You will have to use reg ex code to do it so you might have to post on their forum to find out how exactly.
Instead of going through the hassle you could always just use a custom footprint before harvesting that would only find .govs or .edu's so you won't have to sift through and filter them out.
For example: site:.edu "powered by wordpress" "leave a reply"