Update: I already got the cleaned file, thanks to several of my readers! Appreciate the help!
OK, I downloaded the latest change.xml file from weblogs.com. If you don’t know what weblogs.com is, this is a service that most weblog tools will “ping,” or let know that someone has just published.
In the early days of blogging Dave Winer and other bloggers would watch this page like a hawk since it would display when new people had just posted. Remember, when I started blogging there were only a couple of hundred bloggers with only a few dozen posts a day. You could read this page just like many of us read TechMeme or TailRank now.
Anyway, I just downloaded the last hour and there were more than 60,000 entries in that file. Whew! OK, I went through brute force and cleaned up just the “As.” Brute force means I just went through and deleted them by hand, not using any macro or scripts.
It’s taking too long to do it by hand (60,000 URLs is too many) and, anyway, it’d be fun to redo this test over and over to see if the numbers of blogs done from each service change depending on the day of week and time of day.
Anyway, here’s what I need done. This is a perfect job for Amazon’s Mechanical Turk. That service lets you spec out a small job, and get someone who has a little extra time to do it to do it for you for a reasonable fee.
On the other hand, I’ll also ask here. Here’s what I need:
1) Take my Excel .XLS file (I’ll clean it up and put it into a column for you) and delete all the URLs that don’t come from blogspot.com; wordpress.com; livejournal.com; spaces.live.com; typepad.com.
That’s it. Easy, huh? Should take one of the programmer types here a few minutes to write an Excel macro to do that. If you’d rather me just hand you a comma-delimited text file, I can do that too. Or, you can just go get the file yourself from weblogs.com (it’s an XML file) and clean it up yourself. I just need the URLs, I don’t care about anything else.