Discussion:
Output Errors on VLAN interfaces
Andy Lemin
2016-08-05 13:19:25 UTC
Permalink
Hi guys,

Has anyone else seen issues with "output errors" occurring on only VLAN
interfaces since upgrading to 5.9? (and after using openup to get latest
kernel).

It does not happen on all VLAN interfaces, only ones under load.

The underlying trunk does not report any Rx or Tx errors at all.

And the VLAN interfaces do not report any receive errors, only low rate
transmit errors.


Also as a thought exercise, could anyone kindly explain/discuss how an
output error might even occur or be valid?

You would think that if the packet has been through the whole OpenBSD stack
that it should not have an error on output (input errors, yes, definitely
possible).

But if the packet was/is in error, why is it transmitting it at all, or not
being dropped before the output stage?

Thanks, Andy.
Chris Cappuccio
2016-08-09 01:48:56 UTC
Permalink
Post by Andy Lemin
The underlying trunk does not report any Rx or Tx errors at all.
And the VLAN interfaces do not report any receive errors, only low rate
transmit errors.
Also as a thought exercise, could anyone kindly explain/discuss how an
output error might even occur or be valid?
Look at /usr/src/sys/net/if_vlan.c, you'll find exactly two places where
if_oerrors increments. Logically, both are in the vlan_start() routine.
The first happens after vlan_inject fails. If vlan_inject returns a null
mbuf, that appears to be a failure within m_prepend(), probably from
failure to allocate memory for the new mbuf. Where's your dmesg? Are you
using a card that does hw tagging? (If so, this isn't the codepath you're
looking for.)

If the failure is the new if_enqueue, it seems like ifq_enqueue would be
calling priq_enq which would be returning a failure if the queue is full.
Are you using hfsc?

Chris
Andy Lemin
2016-09-22 17:19:32 UTC
Permalink
Hi Chris,

Sorry for the slow reply. Day job takes up most of my time.

Anyway, I finally added some logging into /usr/src/sys/net/if_vlan.c etc;

if (m == NULL) {

ifp->if_oerrors++;

printf("Output Error due to NULL mbuff\n");

continue;

}

}

if (if_enqueue(ifp0, m)) {

ifp->if_oerrors++;

printf("Output Error from if_enqueue\n");

continue;

}

ifp->if_opackets++;


Recompiled the kernel and rebooted onto it, and pushed traffic through it
(~50Mbps).

And sure enough every single instance of the VLAN Output drops is due
to "if_enqueue(ifp0,
m)" being TRUE. I edited if.c and again confirmed that IFQ_ENQUEUE does
return the error.

Traced it further back to ifq.c:ifq_enqueue_try(), and rv (from rv =
ifq->ifq_ops->ifqop_enq(ifq, m);) is 55 for every one of the VLAN output
drops.


Needed some help from a colleague to figure out what
ifq->ifq_ops->ifqop_enq(ifq,
m) calls.

We believe is should be calling ifq.c:priq_enq(). Still dont understand
that glue part yet :( But after adding some logging on "if (ifq_len(ifq) >=
ifq->ifq_maxlen)" it doesn't seem to be that? So have either made a mistake
or gone as far as my knowledge can go? Any _pointers_ guys? ;)


We do use HFSC (and have done since 5.0 without issues), but only on the
physical interface, not on the VLANs.

The reason for this is so that we can _share_ the whole of the 10Gig
interface root bandwidth across all of the VLANs on the same physical .1q
trunk. This has worked great for years without VLAN output errors. I think
this started after 5.8 or 5.9.

I increased the qlimits from the default but that made no difference.


queue trunk_root on $if_trunk bandwidth 4294M

queue qlocal on $if_trunk parent trunk_root bandwidth 4.1G

queue local_kern on $if_trunk parent qlocal bandwidth 8M min 8M
burst 8M for 1000ms

queue local_pri on $if_trunk parent qlocal bandwidth 150M min 150M
burst 200M for 2500ms qlimit 500

queue local_data on $if_trunk parent qlocal bandwidth 4G min 1G
qlimit 1000

queue qwan on $if_trunk parent trunk_root bandwidth 190M

queue wan_rt on $if_trunk parent qwan bandwidth 30M min 19M burst
38M for 5000ms

queue wan_int on $if_trunk parent qwan bandwidth 19M min 9M

queue wan_pri on $if_trunk parent qwan bandwidth 19M min 10M burst
25M for 2000ms

queue wan_vpn on $if_trunk parent qwan bandwidth 50M min 25M

queue wan_web on $if_trunk parent qwan bandwidth 29M min 10M burst
19M for 3000ms

queue wan_dflt on $if_trunk parent qwan bandwidth 19M min 10M burst
19M for 5000ms

queue wan_bulk on $if_trunk parent qwan bandwidth 20M max 100M
default

.

.

match out on INSIDE all received-on INSIDE queue (local_data,local_pri) set
prio (2,4)


So all traffic flowing from one VLAN to another (on the same trunk) are in
queues local_data and local_pri, however looking at the queue statistics
with systat queues 1, shows these large internal queues never drop a single
packet. Yet if_oerrors for the VLANs is still incrementing quite a lot for
most of our VLANs.


Hi Henning, whilst I have the code open, I am also going to have another go
at trying to find the missing 64bit counter/range check etc for the HFSC
queue size tomorrow (if I dont get dragged onto anything else).


Thanks for your time and help guys,

Kind regards, Andy Lemin
Post by Chris Cappuccio
Post by Andy Lemin
The underlying trunk does not report any Rx or Tx errors at all.
And the VLAN interfaces do not report any receive errors, only low rate
transmit errors.
Also as a thought exercise, could anyone kindly explain/discuss how an
output error might even occur or be valid?
Look at /usr/src/sys/net/if_vlan.c, you'll find exactly two places where
if_oerrors increments. Logically, both are in the vlan_start() routine.
The first happens after vlan_inject fails. If vlan_inject returns a null
mbuf, that appears to be a failure within m_prepend(), probably from
failure to allocate memory for the new mbuf. Where's your dmesg? Are you
using a card that does hw tagging? (If so, this isn't the codepath you're
looking for.)
If the failure is the new if_enqueue, it seems like ifq_enqueue would be
calling priq_enq which would be returning a failure if the queue is full.
Are you using hfsc?
Chris
Loading...