Can't open file
Marked says it can't open because it's not UTF-8, Mou says it can't open it because IT IS UTF-8. Sublime Text, TextMate and Slick Edit open it with no trouble.
Comments are currently closed for this discussion. You can start a new one.
Keyboard shortcuts
Generic
? | Show this help |
---|---|
ESC | Blurs the current field |
Comment Form
r | Focus the comment reply box |
---|---|
^ + ↩ | Submit the comment |
You can use Command ⌘
instead of Control ^
on Mac
Support Staff 1 Posted by Brett on 03 Apr, 2013 01:11 PM
Your file is ISO-8859, which won't be recognized by quite a few apps. Open it in Sublime and choose "Save with Encoding" -> UTF-8 and it should solve that issue. Then you're going to run into an issue where the file is full of non-UTF8 characters, and your lack of line breaks between paragraphs mean that the text will all run together into one paragraph. Needs a little work :).
2 Posted by essin on 04 Apr, 2013 08:22 AM
I'm working on this. The text files were generated by Acrobat "export to
text file". They don't offer any options to choose an encoding or to
specify how to handle line and paragraph breaks. As a result it's both
tantalizingly interesting and almost useless. I've got 150 files and if
I have to clean up each one by hand, it would be just as easy to open
each pdf, copy the text to the clipboard, paste it somewhere and then edit.
I was hoping to use some sort of batch process to turn the text files
into markdown files and then us marked to generate html. So far I
haven't even figured out how to reliably fix the aberrant characters. I
can change the encoding alright but it just transforms on wierd
character into a different one. I haven't found a function what will
find all the unicode apostrophes and change them to ASCII 39's and
change all the unicode hyphens to ASCII 45's.
I'll keep working on it. If you have any suggestions I would appreciate
hearing them.
Thanks,
Dan
Support Staff 3 Posted by Brett on 09 Apr, 2013 03:18 AM
You can script the substitution of UTF8 characters in Python or Ruby. You just have to find the /uXXX codes for the ones you need to swap and then substitute them globally. I don't have time to provide an example at the moment, but it is possible.
You can also script the encoding conversion, either using a command line utility or by reading in the content, forcing the encoding and writing out a new file.
Your formatting issues can be handled with some regular expressions, for the most part. A simple s/\b\n\b/\n\n/ type of substitution (replace all single line breaks with doubles) would help a lot...
If I get a chance I'll take another look at your file and see if I can offer any more concrete solutions.
4 Posted by essin on 09 Apr, 2013 02:16 PM
Thanks, I'll give your suggestions a try.
Brett closed this discussion on 25 Apr, 2013 08:48 PM.