I need Excel help, off to Amazon’s Mechanical Turk

Update: I already got the cleaned file, thanks to several of my readers! Appreciate the help!

OK, I downloaded the latest change.xml file from weblogs.com. If you don’t know what weblogs.com is, this is a service that most weblog tools will “ping,” or let know that someone has just published.

In the early days of blogging Dave Winer and other bloggers would watch this page like a hawk since it would display when new people had just posted. Remember, when I started blogging there were only a couple of hundred bloggers with only a few dozen posts a day. You could read this page just like many of us read TechMeme or TailRank now.

Anyway, I just downloaded the last hour and there were more than 60,000 entries in that file. Whew! OK, I went through brute force and cleaned up just the “As.” Brute force means I just went through and deleted them by hand, not using any macro or scripts.

It’s taking too long to do it by hand (60,000 URLs is too many) and, anyway, it’d be fun to redo this test over and over to see if the numbers of blogs done from each service change depending on the day of week and time of day.

Anyway, here’s what I need done. This is a perfect job for Amazon’s Mechanical Turk. That service lets you spec out a small job, and get someone who has a little extra time to do it to do it for you for a reasonable fee.

On the other hand, I’ll also ask here. Here’s what I need:

1) Take my Excel .XLS file (I’ll clean it up and put it into a column for you) and delete all the URLs that don’t come from blogspot.com; wordpress.com; livejournal.com; spaces.live.com; typepad.com.

That’s it. Easy, huh? Should take one of the programmer types here a few minutes to write an Excel macro to do that. If you’d rather me just hand you a comma-delimited text file, I can do that too. Or, you can just go get the file yourself from weblogs.com (it’s an XML file) and clean it up yourself. I just need the URLs, I don’t care about anything else.

Comments

  1. I just tried to download the xml file, it appears to have a wrong xml format (A string literal was not closed in one of the records), maybe will try later…

  2. I just tried to download the xml file, it appears to have a wrong xml format (A string literal was not closed in one of the records), maybe will try later…

  3. Here is a powershell approach.

    You can paste this into a PS commandline window and you should end up with a scrubbed.csv file ready for Excel.

    If you want, let me know where I can post the file directly.

    $wc = New-Object Net.Webclient
    $wc.DownloadString(“http://rpc.weblogs.com/changes.xml”) > changes.xml

    $data = [xml](get-content changes.xml)

    $data.weblogUpdates.weblog | where {$_.url -match “blogspot.com|wordpress.com|livejournal.com|spaces.live.com|typepad.com”} | Export-Csv -NoTypeInformation scrubbed.csv

  4. Here is a powershell approach.

    You can paste this into a PS commandline window and you should end up with a scrubbed.csv file ready for Excel.

    If you want, let me know where I can post the file directly.

    $wc = New-Object Net.Webclient
    $wc.DownloadString(“http://rpc.weblogs.com/changes.xml”) > changes.xml

    $data = [xml](get-content changes.xml)

    $data.weblogUpdates.weblog | where {$_.url -match “blogspot.com|wordpress.com|livejournal.com|spaces.live.com|typepad.com”} | Export-Csv -NoTypeInformation scrubbed.csv

  5. Robert: You came from Microsoft. You should know a way or two to do this in Excel without any macros or programmer help. I’d take a stab at it, but I left my Windows laptop at work. Maybe tomorrow if nobody else has taken a shot.

  6. Robert: You came from Microsoft. You should know a way or two to do this in Excel without any macros or programmer help. I’d take a stab at it, but I left my Windows laptop at work. Maybe tomorrow if nobody else has taken a shot.

  7. I make it

    147675 in total

    9726 blogspot
    698 wordpress
    522 livejournal
    540 spaces.live
    238 typepad

    for the last changes.xml.

    Typepad users can map different URLs to their typepad weblog, so that may skew things.

    And I may have made gross errors, at 3am. :)

  8. I make it

    147675 in total

    9726 blogspot
    698 wordpress
    522 livejournal
    540 spaces.live
    238 typepad

    for the last changes.xml.

    Typepad users can map different URLs to their typepad weblog, so that may skew things.

    And I may have made gross errors, at 3am. :)

  9. Robert, try using an Advanced Filter in Excel. Enter this as the criterion range:

    url
    “*wordpress*”
    “*blogspot*”
    “*livejournal*”
    “*spaces.live*”
    “*typepad*”

  10. Robert, try using an Advanced Filter in Excel. Enter this as the criterion range:

    url
    “*wordpress*”
    “*blogspot*”
    “*livejournal*”
    “*spaces.live*”
    “*typepad*”

  11. Export to a text file, and from a linux command line:
    grep “blogspot.com|wordpress.com|livejournal.com|spaces.live.com|typepad.com” oldxlsfile.txt > newxlsfile.txt

  12. Export to a text file, and from a linux command line:
    grep “blogspot.com|wordpress.com|livejournal.com|spaces.live.com|typepad.com” oldxlsfile.txt > newxlsfile.txt

  13. Richard: just because someone worked at Microsoft does that mean they know how to write a kernel-level driver? So, what does that have to do with anything?

    Thanks everyone for helping out! Got several versions of the cleaned file now.

  14. Richard: just because someone worked at Microsoft does that mean they know how to write a kernel-level driver? So, what does that have to do with anything?

    Thanks everyone for helping out! Got several versions of the cleaned file now.

  15. Hi Robert,

    If you have grep handy, export a csv, and run

    grep -v ‘blogspot.com|wordpress.com|livejournal.com|spaces.live.com|typepad.com’ file.csv

    If you dont, email me the file, and I’ll do it.

  16. Hi Robert,

    If you have grep handy, export a csv, and run

    grep -v ‘blogspot.com|wordpress.com|livejournal.com|spaces.live.com|typepad.com’ file.csv

    If you dont, email me the file, and I’ll do it.

  17. I’ll organize the excel file for you tonight. I’m starting up a home business as a virtual assitant and this is the kind of thing that I would do for my clients.

  18. I’ll organize the excel file for you tonight. I’m starting up a home business as a virtual assitant and this is the kind of thing that I would do for my clients.

  19. I think a lot of use are looking at different time windows. I get
    Windows Live Spaces 795
    WordPress 363
    LiveJournal 245
    TypePad 114
    Blogspot 8844
    Comparing those numbers with other numbers here suggests wide variation across time. One hour is not going to be a very useful snapshot. Well except for showing huge numbers for Blogspot.com.
    One thing I did not filter for but that I assume you will is duplicates. That is to say a brief scan of my output suggests that a lot of people are posting several times in the same hour window. Most of the ones I saw were in Blogspot.com but who knows how many there are.

  20. I think a lot of use are looking at different time windows. I get
    Windows Live Spaces 795
    WordPress 363
    LiveJournal 245
    TypePad 114
    Blogspot 8844
    Comparing those numbers with other numbers here suggests wide variation across time. One hour is not going to be a very useful snapshot. Well except for showing huge numbers for Blogspot.com.
    One thing I did not filter for but that I assume you will is duplicates. That is to say a brief scan of my output suggests that a lot of people are posting several times in the same hour window. Most of the ones I saw were in Blogspot.com but who knows how many there are.

  21. I was able to do it using Zoho Creator. Please have a look at http://zohocreator.com/sugan/weblogs-tracker/

    You can see
    * a view that displays the data from weblogs.com that matches your criteria
    * form to import the xls to Zoho Creator (using the “Import Data” link at the top of the form)

    I have written a simple two line code to check if the url contains blogspot.com, wordpress.com, livejournal.com, spaces.live.com or typepad.com. Only those entries that contain this in the url will get persisted in Zoho Creator.

    You can view the script at http://zohocreator.com/sugan/weblogs-tracker/73096/viewscript/

    I had taken the xml from weblogs.com (Only the changes made in a span of 5 minutes). The xls that I imported can be found at http://zohocreator.com/sample/xls/weblogs_scobelizer.xls. It has 14,641 entries. But only 796 entries matched your criteria.

    Thanks,
    Suganya.

  22. I was able to do it using Zoho Creator. Please have a look at http://zohocreator.com/sugan/weblogs-tracker/

    You can see
    * a view that displays the data from weblogs.com that matches your criteria
    * form to import the xls to Zoho Creator (using the “Import Data” link at the top of the form)

    I have written a simple two line code to check if the url contains blogspot.com, wordpress.com, livejournal.com, spaces.live.com or typepad.com. Only those entries that contain this in the url will get persisted in Zoho Creator.

    You can view the script at http://zohocreator.com/sugan/weblogs-tracker/73096/viewscript/

    I had taken the xml from weblogs.com (Only the changes made in a span of 5 minutes). The xls that I imported can be found at http://zohocreator.com/sample/xls/weblogs_scobelizer.xls. It has 14,641 entries. But only 796 entries matched your criteria.

    Thanks,
    Suganya.

  23. I had no idea that blogspot was so dominant … I’d really like to see these numbers a oouple of times a day for a week or so. Do you think that weblogs.com might have these statistics readily available or will we have to come up with some sort of script to pull this a couple times a day for a week?

  24. I had no idea that blogspot was so dominant … I’d really like to see these numbers a oouple of times a day for a week or so. Do you think that weblogs.com might have these statistics readily available or will we have to come up with some sort of script to pull this a couple times a day for a week?

  25. wow – I would have used access and attach the worksheet as a table. then use the query designer. second step would be to create a report, and may be a chart for clarity…

    just a thought from up north

  26. wow – I would have used access and attach the worksheet as a table. then use the query designer. second step would be to create a report, and may be a chart for clarity…

    just a thought from up north

  27. I just wrote a small script which will read the file and write the requested urls into a separate xml file. just copy the following code into a notepad and save it in the same folder as changes.xml. Just ignore the tags. I could not mail it to you because my email server rejects any vbscript present in the mail.


    dim fs, ts, ts1
    set fs=CreateObject("Scripting.FileSystemObject")
    dim s
    set ts=fs.opentextfile("changes.xml")
    set ts1=fs.createtextfile("changes1.xml")
    do while ts.atendofstream=false
    s=ts.readline
    if len(trim(s))>0 then
    if instr(1,s,"url") >0 then
    if instr(1,s,"blogspot.com")>0 then
    ts1.writeline s
    elseif instr(1,s,"wordpress.com")>0 then
    ts1.writeline s
    elseif instr(1,s,"livejournal.com")>0 then
    ts1.writeline s
    elseif instr(1,s,"spaces.live.com")>0 then
    ts1.writeline s
    elseif instr(1,s,"typepad.com")>0 then
    ts1.writeline s
    end if
    else
    ts1.writeline s
    end if
    end if
    loop

  28. I just wrote a small script which will read the file and write the requested urls into a separate xml file. just copy the following code into a notepad and save it in the same folder as changes.xml. Just ignore the tags. I could not mail it to you because my email server rejects any vbscript present in the mail.


    dim fs, ts, ts1
    set fs=CreateObject("Scripting.FileSystemObject")
    dim s
    set ts=fs.opentextfile("changes.xml")
    set ts1=fs.createtextfile("changes1.xml")
    do while ts.atendofstream=false
    s=ts.readline
    if len(trim(s))>0 then
    if instr(1,s,"url") >0 then
    if instr(1,s,"blogspot.com")>0 then
    ts1.writeline s
    elseif instr(1,s,"wordpress.com")>0 then
    ts1.writeline s
    elseif instr(1,s,"livejournal.com")>0 then
    ts1.writeline s
    elseif instr(1,s,"spaces.live.com")>0 then
    ts1.writeline s
    elseif instr(1,s,"typepad.com")>0 then
    ts1.writeline s
    end if
    else
    ts1.writeline s
    end if
    end if
    loop

  29. Yeah, wget grep / egrep and alike would do the trick most neatly – if you really would like to go that way, I can give you the stand alone WINDOWS executables.

    This is how I earned my living in controlling the last 10 years: using awk, grep and co to prepare data for excel.

    As for pure Excel I would agree on use ‘count if’ formula, you would always have your list which can be updated and the system would use “count if” with all the blogservices you would like to look at.

  30. Yeah, wget grep / egrep and alike would do the trick most neatly – if you really would like to go that way, I can give you the stand alone WINDOWS executables.

    This is how I earned my living in controlling the last 10 years: using awk, grep and co to prepare data for excel.

    As for pure Excel I would agree on use ‘count if’ formula, you would always have your list which can be updated and the system would use “count if” with all the blogservices you would like to look at.