Google Comments on Virtio Draft Spec

Post by Andrew Thornton
Hi All,
Included below is the Google feedback on “Virtual I/O Device (VIRTIO)
Version 1.0 Committee Speciﬁcation Draft 01 03 December 2013”. This is
feedback collected

Hi Andrew,

The second draft was released more recently, but most of your
comments still apply.

I've assigned it the following:
https://tools.oasis-open.org/issues/browse/VIRTIO-100

And I'll break it into subtasks by section, for various people to
address.

Thanks!
Rusty.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-***@lists.oasis-open.org
Unsubscribe: virtio-comment-***@lists.oasis-open.org
List help: virtio-comment-***@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/

Rusty Russell

2014-06-05 07:05:31 UTC

OK, we've resolved/closed all these issues now. Below is a summary
in conveniently quotable email form. Some responses were via
the mailing list, but this lists all the actual spec changes which
resulted:

VIRTIO-101: Virtio General
Closed (draft 2 has a shutdown section)

VIRTIO-102: PCI discovery
Resolved in r376:
https://tools.oasis-open.org/version-control/browse/wsvn/virtio/?rev=376

VIRTIO-103: PCI Common configuration layout
Resolved in r364 and r365:
https://tools.oasis-open.org/version-control/browse/wsvn/virtio/?rev=364
https://tools.oasis-open.org/version-control/browse/wsvn/virtio/?rev=365

VIRTIO-104: PCI Operation
Resolved in r375:
https://tools.oasis-open.org/version-control/browse/wsvn/virtio/?rev=375

VIRTIO-105: Virtio NET
Closed
(See https://lists.oasis-open.org/archives/virtio-comment/201404/msg00013.html )

VIRTIO-106: Virtio SCSI
Resolved in r372:
https://tools.oasis-open.org/version-control/browse/wsvn/virtio/?rev=372

We don't seem to have closed the loop on getting responses from you, so
we plan on releasing a third and final draft in two weeks.

Cheers,
Rusty.

This publicly archived list offers a means to provide input to the

OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and

to minimize spam in the list archive, subscription is required

before posting.

Subscribe: virtio-comment-***@lists.oasis-open.org

Unsubscribe: virtio-comment-***@lists.oasis-open.org

List help: virtio-comment-***@lists.oasis-open.org

List archive: https://lists.oasis-open.org/archives/virtio-comment/

Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf

List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists

Committee: https://www.oasis-open.org/committees/virtio/

Join OASIS: https://www.oasis-open.org/join/

Andrew Thornton

2014-06-05 07:45:56 UTC

Hi Rusty,

I send a couple of replies that seems to have gone astray.

I have been told there have been some issues with posting to lists from an @
google.com account so I'd be grateful if you could confirm receipt of this
mail.

VIRTIO-101, VIRTIO-102, VIRTIO-103 and VIRTIO-104 all look good.

VIRTIO-105 - we retract the request.

VIRTIO-106 - This is the reply I sent a couple of weeks ago that seems to

'Configuration'

Device configuration layout

1. max_sectors and cmd_per_lun are described as 'hints'

1.1. Can these become hard limits rather than 'hints'? (IE
devices can reject commands above the cmd_per_lun limit
or the max_sectors limit). If so, can we select a specific
error to return in that case?

2. cmd_per_lun describes 'the actual value to be used is the
minimum of cmd_per_lun and the virtqueue size'.

2.1. Does this mean that devices can reject concurrent commands
above min(cmd_per_lun, virtqueue_size)?

2.2. Do you really mean 'virtqueue_size'? At minimum a command
requires at least 2 entries in the virtqueue. Should this
minimum be virtqueue_size / 2?

Device operation: requestq

1. When a transport returns VIRTIO_SCSI_S_BUSY, can we specify that a
guest should retry the request? This would simplify device
implementations
in the face of resource limitations and would allow guests to control
I/O queueing.

2. When a target is hotunplugged with I/O inflight, can we specify which
error
response will be returned for the now-terminated I/Os?

Device operation: controlq

The ordering of Task Management Function completion with
respect to requests they are acting on is unspecified. However
SCSI midlayers require TMF commands complete _after_ the command(s)
they are aborting/reseting.

EX:
requestq controlq
1: REQUEST A
2: TMF ABORT A
3: COMPLETE A
(S_ABORTED)
4: COMPLETION TMF ABORT A
[good ordering]

requestq controlq
1: REQUEST A
2: TMF ABORT A
3: COMPLETION TMF ABORT A
4: COMPLETION A
[bad ordering! surprise completion may corrupt guest memory]

This requires a device ensure ordering between the controlq and
requestq processing; for TMF RESET, this means a reset must
drain all the request queues (searching for undispatched
commands; QEMU does not do this currently and can corrupt guest
memory in the worst case).

1. If we could have a feature flag (VIRTIO_SCSI_F_TMF_ON_REQUESTQ)
that allowed TMF commands to be sent down the requestqueue,ordering
would be naturally enforced and devices would save a lot of complexity.

2. If that is not possible, a guest driver can cycle a
no-op command through request queue(s) before aborting/resetting
a command. To do this, we need to codify a safe no-op command.

We could use a command w/ lun[0] = 0x0 as a safe no-op command.
This is currently the case for QEMU, vhost-scsi, and GCE. We would
like to have this formalized.

Thanks,

Andy

Andrew Thornton

2014-06-06 00:59:12 UTC

Comments below;

'Configuration'
Device configuration layout

Post by Andrew Thornton
1. max_sectors and cmd_per_lun are described as 'hints'
1.1. Can these become hard limits rather than 'hints'? (IE
devices can reject commands above the cmd_per_lun limit
or the max_sectors limit). If so, can we select a specific
error to return in that case?

First of all, I think it's not necessary to select an error for these
cases. These issues are not specific to virtio-scsi and the command
will succeed at the virtio-scsi level; for cmd_per_lun the the status
could be BUSY (not VIRTIO_SCSI_S_BUSY!) or TASK SET FULL, for transfer
length the SCSI standard says you get INVALID FIELD IN CDB. These
status or sense codes are defined in the appropriate SCSI standards.
The configuration limits are imposed by the hypervisor, so transfer
lengths or queue depths higher than the values in the configuration
should cause an error. The reason these are hints is because the issue
is quite complex if you do SCSI passthrough, and in that case a transfer
length or queue depth lower than the limit could trigger an error.
For example, each target or LUN could actually have its own transfer
limit, that is lower than max_sectors. In this case the initiator
should look for the block limits VPD page anyway.
As to cmd_per_lun, you could obey cmd_per_lun and still get TASK SET
FULL responses from the target if the host or other guests are using it
at the same time. Perhaps the hypervisor could change that to BUSY
(again, not VIRTIO_SCSI_S_BUSY), but this is again a generic SCSI target
implementation issue, not specific to virtio-scsi.

If you treat max_sectors and cmd_per_lun as transport-specific limits and
the INQUIRY data/Block Limits VPDs as SCSI-layer limits, the differences
become clearer -- for transfer that fail because of limitations from the
target (INQUIRY/BLOCK LIMITS VPD levels), the target's SCSI-layer response
is the correct one. For transfers that fail because they hit the transport
limit (max_sectors/cmd_per_lun), a transport-level response would make more
sense.

Post by Andrew Thornton
2. When a target is hotunplugged with I/O inflight, can we specify which
error response will be returned for the now-terminated I/Os?

Either I/O is completed, or it is already documented to be ILLEGAL
REQUEST/LOGICAL UNIT NOT SUPPORTED.

When an I/O is already running against a target and it is hotremoved,
ILLEGAL REQUEST/LOGICAL UNIT NOT SUPPORTED is not a correct SCSI response.
That response would be appropriate if the I/O reached the target after it
was removed, but not for the case where the command was already running.

Also, consider long-running operations -- once a guest has already seen
them running on a target (ex: long running operation in progress), it would
not be correct to return ILLEGAL REQUEST/LOGICAL UNIT NOT SUPPORTED.

We would like to see a transport-layer response here, for example
VIRTIO_SCSI_S_TRANSPORT_FAILURE. Hot-unplugging a disk is analogous to
breaking the link to the target and Linux will reflect this as a
DID_TRANSPORT_DISRUPTED host byte.

Post by Andrew Thornton
2. If that is not possible, a guest driver can cycle a
no-op command through request queue(s) before aborting/resetting
a command. To do this, we need to codify a safe no-op command.
We could use a command w/ lun[0] = 0x0 as a safe no-op command.
This is currently the case for QEMU, vhost-scsi, and GCE. We would
like to have this formalized.

1. REPORT LUNS to the well-known address would work. No device emulation
currently supports it to the well-known address, however.

2. Can we specify that it is not an error (the device will not enter a
broken state) to send a command with either an invalid LUN or a command
that is too short? (ex: a 1 byte command?). Even an invalid command
would serve as a fine no-op, if it doesn't wedge the device.

Post by Paolo Bonzini
'Configuration'

Device configuration layout

First of all, I think it's not necessary to select an error for these
cases. These issues are not specific to virtio-scsi and the command will
succeed at the virtio-scsi level; for cmd_per_lun the the status could be
BUSY (not VIRTIO_SCSI_S_BUSY!) or TASK SET FULL, for transfer length the
SCSI standard says you get INVALID FIELD IN CDB. These status or sense
codes are defined in the appropriate SCSI standards.
The configuration limits are imposed by the hypervisor, so transfer
lengths or queue depths higher than the values in the configuration should
cause an error. The reason these are hints is because the issue is quite
complex if you do SCSI passthrough, and in that case a transfer length or
queue depth lower than the limit could trigger an error.
For example, each target or LUN could actually have its own transfer
limit, that is lower than max_sectors. In this case the initiator should
look for the block limits VPD page anyway.
As to cmd_per_lun, you could obey cmd_per_lun and still get TASK SET FULL
responses from the target if the host or other guests are using it at the
same time. Perhaps the hypervisor could change that to BUSY (again, not
VIRTIO_SCSI_S_BUSY), but this is again a generic SCSI target implementation
issue, not specific to virtio-scsi.
2. cmd_per_lun describes 'the actual value to be used is the

Post by Andrew Thornton
minimum of cmd_per_lun and the virtqueue size'.
2.1. Does this mean that devices can reject concurrent commands
above min(cmd_per_lun, virtqueue_size)?

Yes (with TASK SET FULL status). Though if virtqueue_size < cmd_per_lun,
the driver actually won't have room to queue more than virtqueue_size items.
2.2. Do you really mean 'virtqueue_size'? At minimum a command

Post by Andrew Thornton
requires at least 2 entries in the virtqueue. Should this
minimum be virtqueue_size / 2?

Not if you use indirect descriptors.
Device operation: requestq

Post by Andrew Thornton
1. When a transport returns VIRTIO_SCSI_S_BUSY, can we specify that a
guest should retry the request? This would simplify device
implementations in the face of resource limitations and would
allow guests to control I/O queueing.

That usually makes sense, but it does not have to be that way. For
example, under Linux you can mark a request as "failfast" and avoid the
retry.
2. When a target is hotunplugged with I/O inflight, can we specify which

Post by Andrew Thornton
error response will be returned for the now-terminated I/Os?

Either I/O is completed, or it is already documented to be ILLEGAL
REQUEST/LOGICAL UNIT NOT SUPPORTED.
Device operation: controlq

Post by Andrew Thornton
The ordering of Task Management Function completion with
respect to requests they are acting on is unspecified. However
SCSI midlayers require TMF commands complete _after_ the command(s)
they are aborting/reseting.

I and Venkatesh sorted this out on the upstream linux-scsi mailing list
for the abort case.
The ordering of completing TMFs vs. requests are now documented, but the
Linux driver messed up this case.
This requires a device ensure ordering between the controlq and

Post by Andrew Thornton
requestq processing; for TMF RESET, this means a reset must
drain all the request queues (searching for undispatched
commands; QEMU does not do this currently and can corrupt guest
memory in the worst case).

I think this cannot happen on QEMU if the commands are undispatched _and_
the doorbell register has been written, since QEMU is basically
single-threaded. If the doorbell register has not been written to, the
driver is probably buggy (sending a reset and a command at the same time is
probably not a good idea).
1. If we could have a feature flag (VIRTIO_SCSI_F_TMF_ON_REQUESTQ)

Post by Andrew Thornton
that allowed TMF commands to be sent down the requestqueue,ordering
would be naturally enforced and devices would save a lot of complexity.

This would be too late for 1.0. I'm also not convinced it is a good idea,
for if the request queue is full you cannot send TMFs to abort commands.
Also, the virtio-scsi standard does not document how you use multiple
request queues, and multiple request queues would have the same ordering
problems as the separate control queue.
2. If that is not possible, a guest driver can cycle a

Post by Andrew Thornton
no-op command through request queue(s) before aborting/resetting
a command. To do this, we need to codify a safe no-op command.
We could use a command w/ lun[0] = 0x0 as a safe no-op command.
This is currently the case for QEMU, vhost-scsi, and GCE. We would
like to have this formalized.

Paolo Bonzini

2014-06-06 07:54:38 UTC

Post by Paolo Bonzini
As to cmd_per_lun, you could obey cmd_per_lun and still get TASK SET
FULL responses from the target if the host or other guests are using it
at the same time. Perhaps the hypervisor could change that to BUSY
(again, not VIRTIO_SCSI_S_BUSY), but this is again a generic SCSI target
implementation issue, not specific to virtio-scsi.

Ok, then that would be the generic VIRTIO_SCSI_S_FAILURE (DID_ERROR).

Post by Andrew Thornton
2. When a target is hotunplugged with I/O inflight, can we specify which
error response will be returned for the now-terminated I/Os?

Either I/O is completed, or it is already documented to be ILLEGAL
REQUEST/LOGICAL UNIT NOT SUPPORTED.

When an I/O is already running against a target and it is hotremoved,
ILLEGAL REQUEST/LOGICAL UNIT NOT SUPPORTED is not a correct SCSI response.
That response would be appropriate if the I/O reached the target after it
was removed, but not for the case where the command was already running.
Also, consider long-running operations -- once a guest has already seen
them running on a target (ex: long running operation in progress), it would
not be correct to return ILLEGAL REQUEST/LOGICAL UNIT NOT SUPPORTED.
We would like to see a transport-layer response here, for example
VIRTIO_SCSI_S_TRANSPORT_FAILURE. Hot-unplugging a disk is analogous to
breaking the link to the target and Linux will reflect this as a
DID_TRANSPORT_DISRUPTED host byte.

I see what you mean. I agree that VIRTIO_SCSI_S_TRANSPORT_FAILURE is fine.

Post by Andrew Thornton
1. REPORT LUNS to the well-known address would work. No device emulation
currently supports it to the well-known address, however.
2. Can we specify that it is not an error (the device will not enter a
broken state) to send a command with either an invalid LUN or a command
that is too short? (ex: a 1 byte command?). Even an invalid command
would serve as a fine no-op, if it doesn't wedge the device.

Hmm, no, I'd be a bit wary of doing that.

Regarding the failure codes, I think they're a fairly obvious match.
Since there is no "implementation hints" section, I'd rather defer the
change past 1.0.

I guess we could still do the well-known LUN change if I get a review.

Paolo

This publicly archived list offers a means to provide input to the

OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and

to minimize spam in the list archive, subscription is required

before posting.

Subscribe: virtio-comment-***@lists.oasis-open.org

Unsubscribe: virtio-comment-***@lists.oasis-open.org

List help: virtio-comment-***@lists.oasis-open.org

List archive: https://lists.oasis-open.org/archives/virtio-comment/

Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf

List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists

Committee: https://www.oasis-open.org/committees/virtio/

Join OASIS: https://www.oasis-open.org/join/

Rusty Russell

2014-06-06 03:13:13 UTC

Post by Andrew Thornton
Hi Rusty,
I send a couple of replies that seems to have gone astray.
google.com account so I'd be grateful if you could confirm receipt of this
mail.

Sure. List is also publicly archived:

https://lists.oasis-open.org/archives/virtio-comment/

Post by Andrew Thornton
VIRTIO-106 - This is the reply I sent a couple of weeks ago that seems to

'Configuration'

Device configuration layout

1. max_sectors and cmd_per_lun are described as 'hints'
1.1. Can these become hard limits rather than 'hints'? (IE
devices can reject commands above the cmd_per_lun limit
or the max_sectors limit). If so, can we select a specific
error to return in that case?
2. cmd_per_lun describes 'the actual value to be used is the
minimum of cmd_per_lun and the virtqueue size'.
2.1. Does this mean that devices can reject concurrent commands
above min(cmd_per_lun, virtqueue_size)?
2.2. Do you really mean 'virtqueue_size'? At minimum a command
requires at least 2 entries in the virtqueue. Should this
minimum be virtqueue_size / 2?

Indirect descriptors allow this.

The virtqueue_size limit is more a "note: you can't have more than
virtqueue_size descriptors outstanding anyway", rather than something
which needs to be spelled out IMHO.

Leaving the rest to the experts (Paolo cc'd).

Thanks,
Rusty.

Device operation: requestq

1. When a transport returns VIRTIO_SCSI_S_BUSY, can we specify that a
guest should retry the request? This would simplify device
implementations
in the face of resource limitations and would allow guests to control
I/O queueing.
2. When a target is hotunplugged with I/O inflight, can we specify which
error
response will be returned for the now-terminated I/Os?

Device operation: controlq

The ordering of Task Management Function completion with
respect to requests they are acting on is unspecified. However
SCSI midlayers require TMF commands complete _after_ the command(s)
they are aborting/reseting.
requestq controlq
1: REQUEST A
2: TMF ABORT A
3: COMPLETE A
(S_ABORTED)
4: COMPLETION TMF ABORT A
[good ordering]
requestq controlq
1: REQUEST A
2: TMF ABORT A
3: COMPLETION TMF ABORT A
4: COMPLETION A
[bad ordering! surprise completion may corrupt guest memory]
This requires a device ensure ordering between the controlq and
requestq processing; for TMF RESET, this means a reset must
drain all the request queues (searching for undispatched
commands; QEMU does not do this currently and can corrupt guest
memory in the worst case).
1. If we could have a feature flag (VIRTIO_SCSI_F_TMF_ON_REQUESTQ)
that allowed TMF commands to be sent down the requestqueue,ordering
would be naturally enforced and devices would save a lot of complexity.
2. If that is not possible, a guest driver can cycle a
no-op command through request queue(s) before aborting/resetting
a command. To do this, we need to codify a safe no-op command.
We could use a command w/ lun[0] = 0x0 as a safe no-op command.
This is currently the case for QEMU, vhost-scsi, and GCE. We would
like to have this formalized.
Thanks,
Andy

This publicly archived list offers a means to provide input to the

OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and

to minimize spam in the list archive, subscription is required

before posting.

Subscribe: virtio-comment-***@lists.oasis-open.org

Unsubscribe: virtio-comment-***@lists.oasis-open.org

List help: virtio-comment-***@lists.oasis-open.org

List archive: https://lists.oasis-open.org/archives/virtio-comment/

Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf

List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists

Committee: https://www.oasis-open.org/committees/virtio/

Join OASIS: https://www.oasis-open.org/join/

Paolo Bonzini

2014-06-05 09:42:16 UTC

'Configuration'

Device configuration layout

First of all, I think it's not necessary to select an error for these
cases. These issues are not specific to virtio-scsi and the command
will succeed at the virtio-scsi level; for cmd_per_lun the the status
could be BUSY (not VIRTIO_SCSI_S_BUSY!) or TASK SET FULL, for transfer
length the SCSI standard says you get INVALID FIELD IN CDB. These
status or sense codes are defined in the appropriate SCSI standards.

The configuration limits are imposed by the hypervisor, so transfer
lengths or queue depths higher than the values in the configuration
should cause an error. The reason these are hints is because the issue
is quite complex if you do SCSI passthrough, and in that case a transfer
length or queue depth lower than the limit could trigger an error.

For example, each target or LUN could actually have its own transfer
limit, that is lower than max_sectors. In this case the initiator
should look for the block limits VPD page anyway.

As to cmd_per_lun, you could obey cmd_per_lun and still get TASK SET
FULL responses from the target if the host or other guests are using it
at the same time. Perhaps the hypervisor could change that to BUSY
(again, not VIRTIO_SCSI_S_BUSY), but this is again a generic SCSI target
implementation issue, not specific to virtio-scsi.

Post by Andrew Thornton
2. cmd_per_lun describes 'the actual value to be used is the
minimum of cmd_per_lun and the virtqueue size'.
2.1. Does this mean that devices can reject concurrent commands
above min(cmd_per_lun, virtqueue_size)?

Yes (with TASK SET FULL status). Though if virtqueue_size <
cmd_per_lun, the driver actually won't have room to queue more than
virtqueue_size items.

Post by Andrew Thornton
2.2. Do you really mean 'virtqueue_size'? At minimum a command
requires at least 2 entries in the virtqueue. Should this
minimum be virtqueue_size / 2?

Not if you use indirect descriptors.

Device operation: requestq

That usually makes sense, but it does not have to be that way. For
example, under Linux you can mark a request as "failfast" and avoid the
retry.

Post by Andrew Thornton
2. When a target is hotunplugged with I/O inflight, can we specify which
error response will be returned for the now-terminated I/Os?

Either I/O is completed, or it is already documented to be ILLEGAL
REQUEST/LOGICAL UNIT NOT SUPPORTED.