Hi Joakim,
On 8/14/19 9:17 AM, Joakim Lotsengård wrote:
Good morning!
We have seen problems of a DHCP RELEASE packet being sent on incorrect
interface (eth0) when another interface (mlan0) goes down. I have sadly
not had time yet to completely debug this. I apologize in advance for
not having time to test my theory properly. I would like to ask if
someone else has observed this problem to aid my debugging.
We have an embedded Linux device that has Ethernet (wired), WiFi and
cellular uplink connections. (It also has wifi downlink/AP interfaces
that are not controlled by connman.) This error can be reproduced by
having just ethernet (eth0 in our case) and WiFi (mlan0 in our case)
connected. It works with the cellular interface as well, instead of WiFi.
Setup a running tcpdump on ethernet eth0 for DHCP packets. When down the
mlan0 (wifi) interface. You will then see the DHCP release packet for
the IP on mlan0 (wifi) interface goes out on eth0. To make matters
worse, they are sent with the srcIP of eth0 (correct) but the dstIP is
the DHCP-server what was on mlan0 (wifi). The srcIP is probably correct
due to the SNAT-rule that connman added in iptables -t nat. If that rule
is removed, the incorrect srcIP of the old mlan0's IP is used.
My conclusion here is that connman kind of "correctly" detects the
downed mlan0 interface and wants to cancel its DHCP lease. It creates
the DHCP release packet correct, with correct srcIP (of the downed
iface) and dstIP of the DHCP-server that leased us the IP. The bad part
is that now the interface is down. Linux kernel will do its best to
route out the packet. It takes the default path (eth0) and sends it out.
In our case we do have ip_forward enabled due to internal NATing in the
device.
A small reservation here that we do have other complex routing setup in
our device. We have for example downlink wifi interfaces as well and are
routing/NAT in a non straight forward setup. I have not had the time yet
to disable all those things and re-test.
I will try to investigate this further but wanted to ask if someone else
seen this issue. My guess is that a fix is to not send the DHCP release
if the interface we want to release the IP for is down. It kind of makes
sense. There must be very few cases where the DHCP release packet can be
routed to the correct DHCP-server via another interface.
Without looking at the code I suspect, that the DHCP release is trigger
from the interface take down unconditionally.
That is when the interface goes down (user interaction) we send the DHCP
release. This works in the right order.
But when the interface goes down first we still try to send out the DCHP
release not checking if the interface is still up.
connman version used is latest 1.37. However, we tested older of our
firmwares and (at least) 1.36 has the problem as well.
Likely.
Some debugging output:
(I've hidden the true IP of eth0 due to privacy.)
Run and wait for output in one terminal: (Dump ethernet)
$ tcpdump -envvvs 0 -i eth0 "udp and (port 68 or port 67)"
In another terminal run: (Take down a connected wifi interface)
$ ip link set dev mlan0 down
The first terminal gives us this:
10:32:14.802697 00:23:c1:1a:54:11 > 58:97:bd:24:bb:48, ethertype IPv4
(0x0800), length 590: (tos 0x0, ttl 64, id 34388, offset 0, flags [DF],
proto UDP (17), length 576)
172.X.Y.12.68 > 192.168.1.1.67: [bad udp cksum 0xe011 -> 0x3d80!]
BOOTP/DHCP, Request from 24:c3:f9:00:08:89, length 548, xid 0x6eecdf95,
Flags [none] (0x0000)
Client-IP 192.168.1.100
Client-Ethernet-Address 24:c3:f9:00:08:89
Vendor-rfc1048 Extensions
Magic Cookie 0x63825363
DHCP-Message Option 53, length 1: Release
Server-ID Option 54, length 4: 192.168.1.1
END Option 255, length 0
PAD Option 0, length 0, occurs 298
IP 172.X.Y.12 is the IP of eth0. 192.168.1.0/24 is (was) the network of
mlan0.
If i run iptables -t nat -F (clear the nat tables) the dump is:
root@lev-26F7A6CY:/usr/sbin# tcpdump -envvvs 0 -i eth0 "udp and (port 68
or port 67)"
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size
262144 bytes
11:45:13.316968 00:23:c1:1a:54:11 > 58:97:bd:24:bb:48, ethertype IPv4
(0x0800), length 590: (tos 0x0, ttl 64, id 31652, offset 0, flags [DF],
proto UDP (17), length 576)
192.168.1.100.68 > 192.168.1.1.67: [bad udp cksum 0x85f3 ->
0xf9f3!] BOOTP/DHCP, Request from 24:c3:f9:00:08:89, length 548, xid
0x36bfb56d, Flags [none] (0x0000)
Client-IP 192.168.1.100
Client-Ethernet-Address 24:c3:f9:00:08:89
Vendor-rfc1048 Extensions
Magic Cookie 0x63825363
DHCP-Message Option 53, length 1: Release
Server-ID Option 54, length 4: 192.168.1.1
END Option 255, length 0
PAD Option 0, length 0, occurs 298
IP-addresses of the device when both eth0 and mlan0 was up:
wwan0 and wwan1 are cellular, and not used in my test. uap0 is
downlink/AP wifi interface, not handled by connman.
root@lev-26F7A6CY:~# ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,DYNAMIC,UP,LOWER_UP> mtu 1500 qdisc mq
state UP group default qlen 1000
link/ether 00:23:c1:1a:54:11 brd ff:ff:ff:ff:ff:ff
inet 172.X.Y.12/24 brd 172.X.Y.255 scope global eth0
valid_lft forever preferred_lft forever
18: mlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
group default qlen 1000
link/ether 24:c3:f9:00:08:89 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.100/24 brd 192.168.1.255 scope global mlan0
valid_lft forever preferred_lft forever
19: uap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
group default qlen 1000
link/ether 24:c3:f9:00:08:8a brd ff:ff:ff:ff:ff:ff
inet 192.168.5.1/24 scope global uap0
valid_lft forever preferred_lft forever
20: uap1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
default qlen 1000
link/ether 00:50:43:02:00:01 brd ff:ff:ff:ff:ff:ff
21: wwan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
default qlen 1000
link/ether fa:96:11:12:13:14 brd ff:ff:ff:ff:ff:ff
22: wwan1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
default qlen 1000
link/ether fa:96:11:12:13:16 brd ff:ff:ff:ff:ff:ff
I can easily get the full debug output of connman when I take the mlan0
interface down, if needed. However it was a loooong output to post in a
mail. I might also need to scrub the log from private data (IPs).
Also, I have no idea why tcpdump thinks the UDP-packet has incorrect
chksum. The error was discovered due to an ISP reacting to getting
incorrect DHCP-releases from customers using our devices.
Maybe due to checksum offloading?
Thanks,
Daniel