Skip to content
Snippets Groups Projects
  • David Bauer's avatar
    07eeca64
    netifd: system-linux: fix race condition in netlink socket error handing · 07eeca64
    David Bauer authored
    The error handling needed for the buffer growth logic relies on
    uloop_fd's error flag, which is set based on epoll events. Doing so
    without handling recvmsg's error codes is racy, as an error state may be
    set between receiving epoll events and the next recvmsg, but calling
    recvmsg clears the error state.
    
    To fix this, add handling for errors returned by nl_recvmsgs_default()
    and nl_recv(); checking for u->error and retrieving the error status
    using getsockopt() becomes redundant.
    
    We have observed this issue on Gluon (recent OpenWrt 23.05); on some
    devices with DSA switches, the bridge interface's carrier-on event would
    consistenly get lost during boot due to insufficient buffer space
    (see [1]).
    
    We have bisected the issue to netifd commit 516ab774cc16 ("system-linux:
    fix race condition on bringing up wireless devices"), but that commit only
    uncovered the preexisting bug by switching from getting the carrier state
    from sysfs to using the netlink messages in cb_rtnl_event().
    
    I suspect that other recent issues about netifd missing a carrier state
    change like [2] may have the same underlying cause.
    
    [1] https://github.com/freifunk-gluon/gluon/issues/3130
    [2] https://github.com/openwrt/openwrt/issues/13863
    
    
    
    Signed-off-by: default avatarMatthias Schiffer <mschiffer@universe-factory.net>
    Signed-off-by: default avatarDavid Bauer <mail@david-bauer.net>
    07eeca64
    History
    netifd: system-linux: fix race condition in netlink socket error handing
    David Bauer authored
    The error handling needed for the buffer growth logic relies on
    uloop_fd's error flag, which is set based on epoll events. Doing so
    without handling recvmsg's error codes is racy, as an error state may be
    set between receiving epoll events and the next recvmsg, but calling
    recvmsg clears the error state.
    
    To fix this, add handling for errors returned by nl_recvmsgs_default()
    and nl_recv(); checking for u->error and retrieving the error status
    using getsockopt() becomes redundant.
    
    We have observed this issue on Gluon (recent OpenWrt 23.05); on some
    devices with DSA switches, the bridge interface's carrier-on event would
    consistenly get lost during boot due to insufficient buffer space
    (see [1]).
    
    We have bisected the issue to netifd commit 516ab774cc16 ("system-linux:
    fix race condition on bringing up wireless devices"), but that commit only
    uncovered the preexisting bug by switching from getting the carrier state
    from sysfs to using the netlink messages in cb_rtnl_event().
    
    I suspect that other recent issues about netifd missing a carrier state
    change like [2] may have the same underlying cause.
    
    [1] https://github.com/freifunk-gluon/gluon/issues/3130
    [2] https://github.com/openwrt/openwrt/issues/13863
    
    
    
    Signed-off-by: default avatarMatthias Schiffer <mschiffer@universe-factory.net>
    Signed-off-by: default avatarDavid Bauer <mail@david-bauer.net>