Can't open file

essin's Avatar

essin

03 Apr, 2013 08:49 AM

Marked says it can't open because it's not UTF-8, Mou says it can't open it because IT IS UTF-8. Sublime Text, TextMate and Slick Edit open it with no trouble.

  1. Support Staff 1 Posted by Brett on 03 Apr, 2013 01:11 PM

    Brett's Avatar

    Your file is ISO-8859, which won't be recognized by quite a few apps. Open it in Sublime and choose "Save with Encoding" -> UTF-8 and it should solve that issue. Then you're going to run into an issue where the file is full of non-UTF8 characters, and your lack of line breaks between paragraphs mean that the text will all run together into one paragraph. Needs a little work :).

  2. 2 Posted by essin on 04 Apr, 2013 08:22 AM

    essin's Avatar

    I'm working on this. The text files were generated by Acrobat "export to
    text file". They don't offer any options to choose an encoding or to
    specify how to handle line and paragraph breaks. As a result it's both
    tantalizingly interesting and almost useless. I've got 150 files and if
    I have to clean up each one by hand, it would be just as easy to open
    each pdf, copy the text to the clipboard, paste it somewhere and then edit.

    I was hoping to use some sort of batch process to turn the text files
    into markdown files and then us marked to generate html. So far I
    haven't even figured out how to reliably fix the aberrant characters. I
    can change the encoding alright but it just transforms on wierd
    character into a different one. I haven't found a function what will
    find all the unicode apostrophes and change them to ASCII 39's and
    change all the unicode hyphens to ASCII 45's.

    I'll keep working on it. If you have any suggestions I would appreciate
    hearing them.

    Thanks,
    Dan

  3. Support Staff 3 Posted by Brett on 09 Apr, 2013 03:18 AM

    Brett's Avatar

    You can script the substitution of UTF8 characters in Python or Ruby. You just have to find the /uXXX codes for the ones you need to swap and then substitute them globally. I don't have time to provide an example at the moment, but it is possible.

    You can also script the encoding conversion, either using a command line utility or by reading in the content, forcing the encoding and writing out a new file.

    Your formatting issues can be handled with some regular expressions, for the most part. A simple s/\b\n\b/\n\n/ type of substitution (replace all single line breaks with doubles) would help a lot...

    If I get a chance I'll take another look at your file and see if I can offer any more concrete solutions.

  4. 4 Posted by essin on 09 Apr, 2013 02:16 PM

    essin's Avatar

    Thanks, I'll give your suggestions a try.

  5. Brett closed this discussion on 25 Apr, 2013 08:48 PM.

Comments are currently closed for this discussion. You can start a new one.

Keyboard shortcuts

Generic

? Show this help
ESC Blurs the current field

Comment Form

r Focus the comment reply box
^ + ↩ Submit the comment

You can use Command ⌘ instead of Control ^ on Mac