Hi Chris,
Sorry for the slow reply. Day job takes up most of my time.
Anyway, I finally added some logging into /usr/src/sys/net/if_vlan.c etc;
if (m == NULL) {
ifp->if_oerrors++;
printf("Output Error due to NULL mbuff\n");
continue;
}
}
if (if_enqueue(ifp0, m)) {
ifp->if_oerrors++;
printf("Output Error from if_enqueue\n");
continue;
}
ifp->if_opackets++;
Recompiled the kernel and rebooted onto it, and pushed traffic through it
(~50Mbps).
And sure enough every single instance of the VLAN Output drops is due
to "if_enqueue(ifp0,
m)" being TRUE. I edited if.c and again confirmed that IFQ_ENQUEUE does
return the error.
Traced it further back to ifq.c:ifq_enqueue_try(), and rv (from rv =
ifq->ifq_ops->ifqop_enq(ifq, m);) is 55 for every one of the VLAN output
drops.
Needed some help from a colleague to figure out what
ifq->ifq_ops->ifqop_enq(ifq,
m) calls.
We believe is should be calling ifq.c:priq_enq(). Still dont understand
that glue part yet :( But after adding some logging on "if (ifq_len(ifq) >=
ifq->ifq_maxlen)" it doesn't seem to be that? So have either made a mistake
or gone as far as my knowledge can go? Any _pointers_ guys? ;)
We do use HFSC (and have done since 5.0 without issues), but only on the
physical interface, not on the VLANs.
The reason for this is so that we can _share_ the whole of the 10Gig
interface root bandwidth across all of the VLANs on the same physical .1q
trunk. This has worked great for years without VLAN output errors. I think
this started after 5.8 or 5.9.
I increased the qlimits from the default but that made no difference.
queue trunk_root on $if_trunk bandwidth 4294M
queue qlocal on $if_trunk parent trunk_root bandwidth 4.1G
queue local_kern on $if_trunk parent qlocal bandwidth 8M min 8M
burst 8M for 1000ms
queue local_pri on $if_trunk parent qlocal bandwidth 150M min 150M
burst 200M for 2500ms qlimit 500
queue local_data on $if_trunk parent qlocal bandwidth 4G min 1G
qlimit 1000
queue qwan on $if_trunk parent trunk_root bandwidth 190M
queue wan_rt on $if_trunk parent qwan bandwidth 30M min 19M burst
38M for 5000ms
queue wan_int on $if_trunk parent qwan bandwidth 19M min 9M
queue wan_pri on $if_trunk parent qwan bandwidth 19M min 10M burst
25M for 2000ms
queue wan_vpn on $if_trunk parent qwan bandwidth 50M min 25M
queue wan_web on $if_trunk parent qwan bandwidth 29M min 10M burst
19M for 3000ms
queue wan_dflt on $if_trunk parent qwan bandwidth 19M min 10M burst
19M for 5000ms
queue wan_bulk on $if_trunk parent qwan bandwidth 20M max 100M
default
.
.
match out on INSIDE all received-on INSIDE queue (local_data,local_pri) set
prio (2,4)
So all traffic flowing from one VLAN to another (on the same trunk) are in
queues local_data and local_pri, however looking at the queue statistics
with systat queues 1, shows these large internal queues never drop a single
packet. Yet if_oerrors for the VLANs is still incrementing quite a lot for
most of our VLANs.
Hi Henning, whilst I have the code open, I am also going to have another go
at trying to find the missing 64bit counter/range check etc for the HFSC
queue size tomorrow (if I dont get dragged onto anything else).
Thanks for your time and help guys,
Kind regards, Andy Lemin
Post by Chris CappuccioPost by Andy LeminThe underlying trunk does not report any Rx or Tx errors at all.
And the VLAN interfaces do not report any receive errors, only low rate
transmit errors.
Also as a thought exercise, could anyone kindly explain/discuss how an
output error might even occur or be valid?
Look at /usr/src/sys/net/if_vlan.c, you'll find exactly two places where
if_oerrors increments. Logically, both are in the vlan_start() routine.
The first happens after vlan_inject fails. If vlan_inject returns a null
mbuf, that appears to be a failure within m_prepend(), probably from
failure to allocate memory for the new mbuf. Where's your dmesg? Are you
using a card that does hw tagging? (If so, this isn't the codepath you're
looking for.)
If the failure is the new if_enqueue, it seems like ifq_enqueue would be
calling priq_enq which would be returning a failure if the queue is full.
Are you using hfsc?
Chris