V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
jdhao
V2EX  ›  程序员

Python , 以及 Vim 等编辑器是如何处理 newline 的,我的一点研究

  •  
  •   jdhao ·
    jdhao · 2018-12-09 13:30:08 +08:00 · 3420 次点击
    这是一个创建于 2179 天前的主题,其中的信息可能已经有所发展或是发生改变。

    之前也发了一篇帖子问 vim 中 newline 的问题,我自己做了一点研究,文章原文在这里. 这里也贴出一份。

    Everyone programmer knows newline, but maybe not so familiar. In this post, I want to write what I have learned about newline handling in various cases.

    Newline characters on different platforms

    Due to historical reasons, different platforms use different characters to signify a new line. On Windows, <CR><LF> (byte code 0x0D0x0A) is used to represent newline. On Linux, <LF>(byte code 0x0A) is used to represent newline. On older Mac[^1], <CR>(byte code 0x0D) is used.

    <CR> and <LF> are from the old time when typewriters is used for printing texts on paper. <CR> represents carriage return, which means to put the carriage to its left-most position. <LF> represents line feed, which means to move the paper a little higher so that you can type on a new line. You can see that these two actions combined will start a new line ready for typing.

    Newline handling in Python

    In Python 2, there is a universal newline mode, which means that no matter what the file line ending is, it will all be translated to \n in Python when reading files with mode specifier rU.

    In Python 3, things have changed. The old U mode specifier has been deprecated in favor of a newline parameter in the open() method. According to the documentation:

    newline controls how universal newlines mode works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows:

    • When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
    • When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place.

    By default, when reading text files, newline is None, which means that system-dependent newline will be quietly replaced by \n. If you are not aware of this behavior, you may get into trouble. For example, when you read a file with \r\n line ending and want to split the text into lines on Windows platform, if you write the following snippet:

    with open("some_file.txt", "r") as f:
        text = f.read()
    lines = text.split(os.linesep)
    

    you will not be able to split the text into lines, because on Windows, os.linesep is \r\n. But Python has secretly translated the \r\n in the file to \n!

    When writing files, you should also be aware that \n will be translated to platform-dependent line endings.

    Newline handling in different editors

    Vim

    When reading a file, Vim will automatically detect the file format for this file[^2]. Then Vim will replace the platform-dependent newline characters with a special mark to mark the ending of each line. When writing the buffer content back into the file, Vim will write the actual newline characters based on the detected file format.

    For example, if you open a file with Windows-style line ending(i.e., <cr><lf>), Vim will replace all <CR><LF> with its own newline mark. If you try to search these two characters using their byte code (\%x0A for <CR> and \%x0D for <LF> ), you will find nothing. Neither can you find <CR> characters using \r in a proper Windows file detected by Vim. When searching in Vim, \n is used to specify end of line, no matter what the actual newline character is for this file. So you can search the line end with \n.</lf></cr>

    How do I show the <CR> characters in Vim then?

    You can open a Windows file in Vim and use e ++ff=unix[^4] to force Vim to treat this file as a unix file. Vim will treat the \n characters as newline, thus removing it from the buffer. But the \r characters in the file will now be treated as normal characters and will be shown as ^M. You will see it now.

    You can also press <Ctrl-V> and then press <Enter> to type a carriage return character. Then you can search this character using \r.

    A caveat in searching and replacing newlines

    In Vim, \n is used to represent newline only when you are searching it. If you want to represent a newline in replacement, use \r instead[^3].

    Sublime Text

    According to discussions here, Sublime Text will also convert platform-dependent newline to \n in memory. When writing to files, it will write newlines according to the detected file type (Windows, Unix or Mac).

    Notepad++

    Notepad++ is also a popular code editor. It can detect your line endings, but it will not replace the newline with \n. To show the newline characters in a file, go to View --> Show Symbol and toggle on option Show End of Line, you will be able to see the newline characters.

    Conversion between different file formats?

    In Vim, you can use set ff=<Format> to covert the current file to desired format, where <Format> can be unix, dos or mac.

    In Sublime Text, just choose the desired format from the bottom right status bar.

    In Notepad++, go to Edit --> EOL Conversion and choose the desired file format.

    There are also tools such as dos2unix and unix2dos which convert between different file formats.

    References


    Title image is taken from here.

    [^1]: Newer Mac system also use the unix-style newline character, see discussion here.

    [^2]: In vim, use :h fileformats, :h file-read and :h file-formats for more info about how Vim detects and file format and reads files.

    [^3]: In replace, \n is used to mean null character \0, which is show as ^@ in Vim. See here for more discussions.

    [^4]: Use :h ++ff to find more information about what this command means.

    19 条回复    2018-12-10 12:09:44 +08:00
    storypanda
        1
    storypanda  
       2018-12-09 14:15:45 +08:00
    Perhaps i think you need post a chinese vision and the english version as a link,and need a toc.
    Bec this is more chinese programmer(I still know this bbs have english language to chance).
    storypanda
        2
    storypanda  
       2018-12-09 14:17:52 +08:00
    *to change.

    你也许应该发布一个中文版的帖子,英文的一个链接就够了,这里更多的是中国的程序员。当然,我也知道论坛有英文可选。
    hhhsuan
        3
    hhhsuan  
       2018-12-09 14:52:57 +08:00 via Android
    不要用 windows 的换行,即便在 Windows 上。
    zn
        4
    zn  
       2018-12-09 15:05:20 +08:00   ❤️ 4
    这,难道不是常识吗?。。。。
    jdhao
        5
    jdhao  
    OP
       2018-12-09 15:11:18 +08:00 via Android
    @storypanda 只写了英文的,中文翻译太麻烦,精力不够
    jdhao
        6
    jdhao  
    OP
       2018-12-09 15:13:19 +08:00 via Android
    @zn 对我来说不是
    @hhhsuan 很多时候默认就是 windows 换行,而且现代的编辑器不同换行格式都可以处理,也无所谓了
    misaka19000
        7
    misaka19000  
       2018-12-09 15:19:16 +08:00 via Android
    v 站不建议全文转载,这种东西放过链接就行了
    masker
        8
    masker  
       2018-12-09 16:29:42 +08:00 via Android
    加个转载比较好
    azh7138m
        9
    azh7138m  
       2018-12-09 16:49:53 +08:00 via Android
    @masker 楼主是原作者,也不能算是转载吧
    ETiV
        10
    ETiV  
       2018-12-09 17:24:35 +08:00 via iPhone   ❤️ 1
    看了头俩个词就不想读了
    hsfzxjy
        11
    hsfzxjy  
       2018-12-09 17:41:54 +08:00 via Android
    好多语病
    jdhao
        12
    jdhao  
    OP
       2018-12-09 18:19:52 +08:00 via Android
    @hsfzxjy 譬如?
    suueyoung
        13
    suueyoung  
       2018-12-09 18:39:07 +08:00   ❤️ 1
    @ETiV 是诶. 看见这种语病的就没心情读了.
    jdhao
        14
    jdhao  
    OP
       2018-12-09 18:44:08 +08:00
    @suueyoung 手抖,写错了,毕竟这么多字,俺也没有编辑啊。在原文已修正
    @ETiV 在原文已修正,谢谢指出。
    dacapoday
        15
    dacapoday  
       2018-12-09 19:28:03 +08:00
    看着累,看完发现是常识,还有些个 IDE 比如 keil 也是这样。少见多怪。
    jdhao
        16
    jdhao  
    OP
       2018-12-09 19:38:29 +08:00 via Android
    @dacapoday 知道就不用看了啊,知道为啥还看 😂
    1423
        17
    1423  
       2018-12-09 19:49:06 +08:00 via Android   ❤️ 2
    讨厌这种网红风格的文章
    包装很好,内容很浅,废话很多
    让我有一种看营销号发的视频的感觉
    masker
        18
    masker  
       2018-12-09 21:00:19 +08:00 via Android
    @jdhao 别人说看完才发现
    oyjw443523
        19
    oyjw443523  
       2018-12-10 12:09:44 +08:00
    写得不错啊。了解了点新知识
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   5144 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 28ms · UTC 03:53 · PVG 11:53 · LAX 19:53 · JFK 22:53
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.