Microsoft Word generates clean HTML for blogs?

Awesome. One Microsoft team heard my pleas for clean XHTML. By the way, the new Word 2007 has the ability to post to blogs built in. Joe Friend has been writing about it.

Lots of Microsoft program managers push back and say "normal people don't care about HTML quality."

That might be true (although we all hate it when our pages don't display right on all browsers, or when they are slow to load) but the influentials who write reviews and tell their friends (or set up their computers) do care about such things.

One guy told me "but we have 10s of millions of users already, so why should we care?"

Well, you woulda had 10,000,001 if you had clean HTML. :-)

  • Keith Patrick

    I am always amazed at how unreadable MS-generated HTML is. If someone can make sense of the indenting logic from ASP.Net, for instance, let me know.

  • Keith Patrick

    I am always amazed at how unreadable MS-generated HTML is. If someone can make sense of the indenting logic from ASP.Net, for instance, let me know.

  • http://www.HubSpot.com Dharmesh Shah

    Agreed. I use an email-to-blog feature and hence get HTML generated through Outlook (which is basically Word).

    Would be nice to have this be clean HTML. Looking forward to trying Office 2007 and see if it improves the situation.

  • http://onstartups.com Dharmesh Shah

    Agreed. I use an email-to-blog feature and hence get HTML generated through Outlook (which is basically Word).

    Would be nice to have this be clean HTML. Looking forward to trying Office 2007 and see if it improves the situation.

  • http://winsmarts.com/ Sahil Malik

    But then there is a big disadvantage of XHTML.

    Now people can write tools to steal my content, whereas earlier they would have had to copy paste.

  • http://winsmarts.com Sahil Malik

    But then there is a big disadvantage of XHTML.

    Now people can write tools to steal my content, whereas earlier they would have had to copy paste.

  • http://scobleizer.wordpress.com/ Robert Scoble

    Sahil, if you are worried about that then I wouldn’t put content on the Internet. As long as a browser can display your content it can be stolen. There’s lots of screen scraping tools and they work just fine with cruddy HTML. Yeah, XHTML makes the job a little easier, but if you are worried about that I wouldn’t publish on the Web.

  • http://scobleizer.wordpress.com/ Robert Scoble

    Sahil, if you are worried about that then I wouldn’t put content on the Internet. As long as a browser can display your content it can be stolen. There’s lots of screen scraping tools and they work just fine with cruddy HTML. Yeah, XHTML makes the job a little easier, but if you are worried about that I wouldn’t publish on the Web.

  • http://crueltobekind.org/ Nicole Simon

    I would use Word for blogging – because I can work fast with it, it has a decent macro recorder, there are keyboard shortcuts for all the goodies (and I expect alt+1-3 to get H1-H3)

    I might even use it for blogging directly to some of the given hosts or otherwise I would try to get the HTML code out of it.

    Make that thing nice, and I can stop bitching about stupid blog editors which do not get the basics about windows software UI.

  • http://crueltobekind.org Nicole Simon

    I would use Word for blogging – because I can work fast with it, it has a decent macro recorder, there are keyboard shortcuts for all the goodies (and I expect alt+1-3 to get H1-H3)

    I might even use it for blogging directly to some of the given hosts or otherwise I would try to get the HTML code out of it.

    Make that thing nice, and I can stop bitching about stupid blog editors which do not get the basics about windows software UI.

  • http://superrob.blogspot.com/ Rob Stevens

    “Lots of Microsoft program managers push back and say ‘normal people don’t care about HTML quality.’”

    This mentality totally escapes me. Just because the sheep don’t understand why something is important, that doesn’t make it any less important! That’s a lazy answer, and if you think people who don’t implement RSS on their websites should be fired, than people with this mentality should be double-fired. :)

    There’s a lot of things that our country has had to deal with because no one has stepped up to challenge the status quo. But boy, when that person steps up, it changes everything. Microsoft needs to be promoting managers who are willing to say that even though this is harder and might cost more, it’s the right thing to do for our customers (whether the customers know it or not).

  • http://superrob.blogspot.com Rob Stevens

    “Lots of Microsoft program managers push back and say ‘normal people don’t care about HTML quality.’”

    This mentality totally escapes me. Just because the sheep don’t understand why something is important, that doesn’t make it any less important! That’s a lazy answer, and if you think people who don’t implement RSS on their websites should be fired, than people with this mentality should be double-fired. :)

    There’s a lot of things that our country has had to deal with because no one has stepped up to challenge the status quo. But boy, when that person steps up, it changes everything. Microsoft needs to be promoting managers who are willing to say that even though this is harder and might cost more, it’s the right thing to do for our customers (whether the customers know it or not).

  • Pingback: Black Bag Operations Network » BadPage.info has a new purpose?

  • http://www.zoundry.com/ Eric

    Yes, the HTML generated by the various Microsoft products is pretty awful. It is particularly irksome with respect to blogging, since many blog APIs would like you to give them XHTML. Our application does its best, by providing an XHTML source validator in the editor, as well as attempting to clean up any HTML it finds (and turn it into XHTML). Still, there is only so much Tidy + our own cleanup can do with really terrible HTML.

  • http://www.zoundry.com Eric

    Yes, the HTML generated by the various Microsoft products is pretty awful. It is particularly irksome with respect to blogging, since many blog APIs would like you to give them XHTML. Our application does its best, by providing an XHTML source validator in the editor, as well as attempting to clean up any HTML it finds (and turn it into XHTML). Still, there is only so much Tidy + our own cleanup can do with really terrible HTML.

  • Uses the Source

    Cool! Maybe next we can get URLs that don’t suck, so we can mash them up?

  • Uses the Source

    Cool! Maybe next we can get URLs that don’t suck, so we can mash them up?

  • Timothy McClanahan

    Okay, that MS guy who asked why should they care? THAT guy should be FIRED. And all the other MS people like him. Sheesh. And MS wonders why web developers hate them? Get a clue.

  • Timothy McClanahan

    Okay, that MS guy who asked why should they care? THAT guy should be FIRED. And all the other MS people like him. Sheesh. And MS wonders why web developers hate them? Get a clue.

  • Mike

    I don’t know what you mean by clean Html. Let’s assume someone uses one of the new fonts coming along with Word 2007. Those fonts are unknown to most others. So you get crap.

  • Mike

    I don’t know what you mean by clean Html. Let’s assume someone uses one of the new fonts coming along with Word 2007. Those fonts are unknown to most others. So you get crap.

  • http://zoundry.com/ Pidge

    Mike,
    MS team has taken the step in the right direction (xhtml markup). Lets hope the MS Word developers take the extra step to include alternate font families (including generics) in the generated CSS so that xhtml presentation degrades gracefully in cases where new fonts are used.

    And more importantly – for the benefit of application developers (like us), MS Word team should extend the current work to support xhtml markup in clipboard operations (copy/paste, drag and drop etc).

  • http://zoundry.com Pidge

    Mike,
    MS team has taken the step in the right direction (xhtml markup). Lets hope the MS Word developers take the extra step to include alternate font families (including generics) in the generated CSS so that xhtml presentation degrades gracefully in cases where new fonts are used.

    And more importantly – for the benefit of application developers (like us), MS Word team should extend the current work to support xhtml markup in clipboard operations (copy/paste, drag and drop etc).

  • http://www.flutterby.com/ Dan Lyke

    Next Monday I’m meeting with folks from a local non-profit who need a better web page, one with dynamic content about their events. My preferred solution goes something like “use this template when you write up your event, and save things into this folder, I’ll whip up some Perl to do the rest”. Alas, I haven’t even thought about broaching that because I’m sure that they’re running Office, and I didn’t want to deal with trying to extract text that eventually becomes HTML from that, mostly based on how bad Word’s HTML has been up to this point, and the fact that I had no good way of extracting text from .DOC files.

    So I’ll most likely be trying to slide “Gee, if you only used the OpenOffice.org suite we could do this the easy way, but instead you’ll have to enter things twice, too bad you use Microsoft software” into the conversation.

    Which adds my voice to the chorus of “fire the guy who said ‘normal people don’t care about HTML quality.’” Your existing customers may not care, but those of us who deliberately try not to be your customers for precisely these sorts of reasons might not be so far from the fold if those project managers had cared a little bit more about software quality.

  • http://www.flutterby.com/ Dan Lyke

    Next Monday I’m meeting with folks from a local non-profit who need a better web page, one with dynamic content about their events. My preferred solution goes something like “use this template when you write up your event, and save things into this folder, I’ll whip up some Perl to do the rest”. Alas, I haven’t even thought about broaching that because I’m sure that they’re running Office, and I didn’t want to deal with trying to extract text that eventually becomes HTML from that, mostly based on how bad Word’s HTML has been up to this point, and the fact that I had no good way of extracting text from .DOC files.

    So I’ll most likely be trying to slide “Gee, if you only used the OpenOffice.org suite we could do this the easy way, but instead you’ll have to enter things twice, too bad you use Microsoft software” into the conversation.

    Which adds my voice to the chorus of “fire the guy who said ‘normal people don’t care about HTML quality.’” Your existing customers may not care, but those of us who deliberately try not to be your customers for precisely these sorts of reasons might not be so far from the fold if those project managers had cared a little bit more about software quality.

  • http://www.reeftooutback.com/ Bill Hutchison

    This is great news. A lot of our staff copy and paste content from Word onto the various web-sites that our non-profit organization run. The code from Word sometimes breaks the pages and also means that we have had to create special “paste from Word” applications on some of our sites.

    By generating clean HTML it will greatly increase the productivity of our writers.

  • http://www.reeftooutback.com Bill Hutchison

    This is great news. A lot of our staff copy and paste content from Word onto the various web-sites that our non-profit organization run. The code from Word sometimes breaks the pages and also means that we have had to create special “paste from Word” applications on some of our sites.

    By generating clean HTML it will greatly increase the productivity of our writers.

  • http://www.bivingsreport.com/ Todd Zeigler

    I can’t even imagine how much time is wasted worldwide on a daily basis cleaning up dirty Word code. I also can’t imagine how much truly awful HTML has made it onto the web as a result of Word – posted by people who don’t have the knowledge to clean it up or just don’t bother. This goes way beyond bloggers.

    Any step to improve the situation is welcomed.

  • http://www.bivingsreport.com Todd Zeigler

    I can’t even imagine how much time is wasted worldwide on a daily basis cleaning up dirty Word code. I also can’t imagine how much truly awful HTML has made it onto the web as a result of Word – posted by people who don’t have the knowledge to clean it up or just don’t bother. This goes way beyond bloggers.

    Any step to improve the situation is welcomed.

  • http://tojosan.blogspot.com/ Todd

    Rob,
    Thanks for sharing the inside scoop on the future “cleanness”. It would sure make things easier at work to get working webpages. Since the inclusion of the ability to publish html from Word, folks have been able to create any old crap.(No offense, probably their fault as much as anything.) But the big reason is that it hides what it is doing, and they feel what they don’t see doesn’t hurt.

    I’m truly shocked to read the quotes above though, even from Microserfs. For a minute I’m tempted to believe you made those up just to humor ‘those that hate Microsoft’. Do professional software engineers really say those sorts of things and mean it?

    Keep up giving us these insights either way.

  • http://tojosan.blogspot.com Todd

    Rob,
    Thanks for sharing the inside scoop on the future “cleanness”. It would sure make things easier at work to get working webpages. Since the inclusion of the ability to publish html from Word, folks have been able to create any old crap.(No offense, probably their fault as much as anything.) But the big reason is that it hides what it is doing, and they feel what they don’t see doesn’t hurt.

    I’m truly shocked to read the quotes above though, even from Microserfs. For a minute I’m tempted to believe you made those up just to humor ‘those that hate Microsoft’. Do professional software engineers really say those sorts of things and mean it?

    Keep up giving us these insights either way.

  • http://www.darrenbarefoot.com dbarefoot

    Can I get an amen? Thank the Lord for clean HTML.

  • http://www.darrenbarefoot.com Darren

    Can I get an amen? Thank the Lord for clean HTML.

  • http://www.techwatch.co.uk/ Brian

    I remember when I built my first websites, back in 2000, I used to use Word 97 with the HTML editor as it was simple and easy to use for someone inexperienced in HTML like myself.

    It’s nice to hear that Word can yet have a role in web publishing in the modern web – while maybe there are better tools out there, few have the sheer accessibility of Word.

  • http://www.techwatch.co.uk/ Brian

    I remember when I built my first websites, back in 2000, I used to use Word 97 with the HTML editor as it was simple and easy to use for someone inexperienced in HTML like myself.

    It’s nice to hear that Word can yet have a role in web publishing in the modern web – while maybe there are better tools out there, few have the sheer accessibility of Word.

  • Mat Steeples

    at #7. Am I the only person to find it funny that http://www.badpage.info doesn’t even validate against it’s own tests? :P

    Mat

  • Mat Steeples

    at #7. Am I the only person to find it funny that http://www.badpage.info doesn’t even validate against it’s own tests? :P

    Mat

  • http://firsttube.com Adam S

    Seriously, though, why would I want to use a word processing program to post to a blog? Any blog software worth its weight in salt has an interface for quickly posting, many of them have WYSIWYG interfaces now. If your blog supports the metaweblog API, which I can safely suppose from the screenshots is a requirement of using Word like this anyway, it almost definitely has customizable posting capabilities already.

    I saw this, and I suppose it’s nice to add a new feature beyond a shell that blends into the Windows environment, but isn’t this just the Word team adding features for the sake of adding features? Were there people clamoring for this?

  • Adam S

    Seriously, though, why would I want to use a word processing program to post to a blog? Any blog software worth its weight in salt has an interface for quickly posting, many of them have WYSIWYG interfaces now. If your blog supports the metaweblog API, which I can safely suppose from the screenshots is a requirement of using Word like this anyway, it almost definitely has customizable posting capabilities already.

    I saw this, and I suppose it’s nice to add a new feature beyond a shell that blends into the Windows environment, but isn’t this just the Word team adding features for the sake of adding features? Were there people clamoring for this?

  • twisterjosh

    Saving the web page as an HTML (Filtered) page usually takes away a lot of the crappy html… still not as clean as I’d like it though, but only a million times better than saving it as you normally would.

  • twisterjosh

    Saving the web page as an HTML (Filtered) page usually takes away a lot of the crappy html… still not as clean as I’d like it though, but only a million times better than saving it as you normally would.

  • http://www.caraworks.com/ Scott Fletcher

    Thanks for bringing that up! When blogging/building my show notes for Podcheck, I copy-paste from my Outlook e-mails. I have hacked countless PHP and java scripts on my blog site to strip out those lame “mso:normal blah; blah:lame size:12pt style:css:unecessary:crap-12 mso:makes-me:so-mad” tags. Thank you Scoble!

  • http://www.caraworks.com Scott Fletcher

    Thanks for bringing that up! When blogging/building my show notes for Podcheck, I copy-paste from my Outlook e-mails. I have hacked countless PHP and java scripts on my blog site to strip out those lame “mso:normal blah; blah:lame size:12pt style:css:unecessary:crap-12 mso:makes-me:so-mad” tags. Thank you Scoble!

  • http://manypies.blogspot.com/ Paul Morriss

    Tell those program managers “your competitors won’t be able to boast that their product can clean up Word HTML”. I’ve got Dreamweaver Ultradev 4 and it has a special function to do just that.

    I’ve had to sort out a web page created using a dynamic edit control where the user had pasted Word text into the WYSIWYG edit box. He cared that it looked a mess by the time it was rendered on a web page.

  • http://manypies.blogspot.com Paul Morriss

    Tell those program managers “your competitors won’t be able to boast that their product can clean up Word HTML”. I’ve got Dreamweaver Ultradev 4 and it has a special function to do just that.

    I’ve had to sort out a web page created using a dynamic edit control where the user had pasted Word text into the WYSIWYG edit box. He cared that it looked a mess by the time it was rendered on a web page.

  • Ben

    Are you they aren’t just saying that “normal people don’t care about HTML quality” because they can’t openly tell you the truth – that they’ve been told to generate munged output that only works in IE as an attempt at browser lock-in?

    If that’s not the purpose, it’s certainly an effect.

    I think there’s a certain level of doublethink going on at Microsoft. In the past, you could at least be open internally about anti-competitive moves like integrating IE into the OS (Brad Silverberg’s famous comment about the pressure to do “unnatural and losing things to ‘protect’ Windows” and fear that “trying to win the Internet using Windows is a losing strategy”), but now there’s too much risk of a conversation like that coming to light via subpoena (as Brad’s did).

    There are still so many little details of Windows and MS software that are clearly attempts to achieve lock-in (why is the MP3 ripping support in my Media Player limited to only 56kbps quality, if not to steer me into using WMA?). Someone’s taken the decision to pursue a lock-in strategy there, but how is this stuff discussed? Face-to-face in a sauna to avoid bugging? By exchanging memos on rice paper?

    Not only is lock-in annoying, but in an age of software distribution via HTTP, it’s going to get increasingly ineffective. When I can download iTunes and VLC, why would I stick with crappy WMP? Users might not fully understand why they can’t turn up the MP3 quality slider, but they’ll certainly have learnt that it’s better to try and find an alternative that works properly than try to fix the Microsoft solution.

    There’s a conversation that’s worth having within Microsoft. Lock-in is lame and it doesn’t work, so why don’t you eradicate it? Does Hotmail have free POP access for everyone yet, like Gmail does? Does *every* app have an Export button as well as an Import button (and is the output in an open format like OPML? and is it human readable or hopelessly munged and riddled with CDATA?)

    If MS apps are truly offering the best experience, and not just relying on bundling and lock-in, then they should let users get their data out in useful formats. The Word team has taken a step in the right direction.

  • Ben

    Are you they aren’t just saying that “normal people don’t care about HTML quality” because they can’t openly tell you the truth – that they’ve been told to generate munged output that only works in IE as an attempt at browser lock-in?

    If that’s not the purpose, it’s certainly an effect.

    I think there’s a certain level of doublethink going on at Microsoft. In the past, you could at least be open internally about anti-competitive moves like integrating IE into the OS (Brad Silverberg’s famous comment about the pressure to do “unnatural and losing things to ‘protect’ Windows” and fear that “trying to win the Internet using Windows is a losing strategy”), but now there’s too much risk of a conversation like that coming to light via subpoena (as Brad’s did).

    There are still so many little details of Windows and MS software that are clearly attempts to achieve lock-in (why is the MP3 ripping support in my Media Player limited to only 56kbps quality, if not to steer me into using WMA?). Someone’s taken the decision to pursue a lock-in strategy there, but how is this stuff discussed? Face-to-face in a sauna to avoid bugging? By exchanging memos on rice paper?

    Not only is lock-in annoying, but in an age of software distribution via HTTP, it’s going to get increasingly ineffective. When I can download iTunes and VLC, why would I stick with crappy WMP? Users might not fully understand why they can’t turn up the MP3 quality slider, but they’ll certainly have learnt that it’s better to try and find an alternative that works properly than try to fix the Microsoft solution.

    There’s a conversation that’s worth having within Microsoft. Lock-in is lame and it doesn’t work, so why don’t you eradicate it? Does Hotmail have free POP access for everyone yet, like Gmail does? Does *every* app have an Export button as well as an Import button (and is the output in an open format like OPML? and is it human readable or hopelessly munged and riddled with CDATA?)

    If MS apps are truly offering the best experience, and not just relying on bundling and lock-in, then they should let users get their data out in useful formats. The Word team has taken a step in the right direction.

  • http://zoundry.com/ Pidge

    Paul,
    Just like Dream Weaver, our blog editor (Zoundry), also has to do “special work” to clean up MS Word generated content. The Word’s “Save as Filtered HTML” helps, but we found most users simply copy and paste from MS Word (can’t blame them).

    In our case, (in addition for validation), we have to go to great pains to clean up (at least we try) the content so that we have a well formed xml document (which we need to use as a in-memory memory model).

  • http://zoundry.com Pidge

    Paul,
    Just like Dream Weaver, our blog editor (Zoundry), also has to do “special work” to clean up MS Word generated content. The Word’s “Save as Filtered HTML” helps, but we found most users simply copy and paste from MS Word (can’t blame them).

    In our case, (in addition for validation), we have to go to great pains to clean up (at least we try) the content so that we have a well formed xml document (which we need to use as a in-memory memory model).

  • http://www.webolize.com/ Elias

    Regarding post #24 – pretty good analysis!

    I just checked windows media player ver 10
    and it seems to allow mp3 up to 320 Kbps.
    And the max for Windows Media Audio is 192Kbps.
    There is a lossless option that goes up to 940Kbps.

    I happen to believe that you are right about previous versions. So hopefully other things will continue in this direction (of competative sotware).