Begin Viewing Query Results Before Query Ends

2020-04-16 04:56发布

问题:

Lets say I query a table with 500K rows. I would like to begin viewing any rows in the fetch buffer, which holds the result set, even though the query has not yet completed. I would like to scroll thru the fetch buffer. If I scroll too far ahead, I want to display a message like: "REACHED LAST ROW IN FETCH BUFFER.. QUERY HAS NOT YET COMPLETED".

  • Could this be accomplished using fgets() to read the fetch buffer while the query continues building the result set? Doing this implies multi-threading*

Can a feature like this, other than the FIRST ROWS hint directive, be provided in Oracle, Informix, MySQL, or other RDBMS?

The whole idea is to have the ability to start viewing rows before a long query completes, while displaying a counter of how many rows are available for immediate viewing.

EDIT: What I'm suggesting may require a fundamental change in a DB server's architecture, as to the way they handle their internal fetch buffers, e.g. locking up the result set until the query has completed, etc. A feature like the one I am suggesting would be very useful, especially for queries which take a long time to complete. Why have to wait until the whole query completes, when you could start viewing some of the results while the query continues to gather more results!

回答1:

There are three basic limiting factors:

  1. The execution plan of the query. If the execution plan has a blocking operation at the end (such as a sort or an eager spool), the engine cannot return rows early in the query execution. It must wait until all rows are fully processed, after which it will return the data as fast as possible to the client. The time for this may itself be appreciable, so this part could be applicable to what you're talking about. In general, though, you cannot guarantee that a query will have much available very soon.

  2. The database connection library. When returning recordsets from a database, the driver can use server-side paging or client-side paging. Which is used can and does affect which rows will be returned and when. Client-side paging forces the entire query to be returned at once, reducing the opportunity for displaying any data before it is all in. Careful use of the proper paging method is crucial to any chance to display data early in a query's lifetime.

  3. The client program's use of synchronous or asynchronous methods. If you simply copy and paste some web example code for executing a query, you will be a bit less likely to be working with early results while the query is still running—instead the method will block and you will get nothing until it is all in. Of course, server-side paging (see point #2) can alleviate this, however in any case your application will be blocked for at least a short time if you do not specifically use an asynchronous method. For anyone reading this who is using .Net, you may want to check out Asynchronous Operations in .Net Framework.

If you get all of these right, and use the FAST FIRSTROW technique, you may be able to do some of what you're looking for. But there is no guarantee.



回答2:

Paraphrasing:

I have a table with 500K rows. An ad-hoc query without a good index to support it requires a full table scan. I would like to immediately view the first rows returned while the full table scan continues. Then I want to scroll through the next results.

It seems that what you would like is some sort of system where there can be two (or more) threads at work. One thread would be busy synchronously fetching the data from the database, and reporting its progress to the rest of the program. The other thread would be dealing with the display.

In the meantime, I would like to display the progress of the table scan, example: "Searching...found 23 of 500,000 rows so far".

It isn't clear that your query will return 500,000 rows (indeed, let us hope it does not), though it may have to scan all 500,000 rows (and may well have only found 23 rows that match so far). Determining the number of rows to be returned is hard; determining the number of rows to be scanned is easier; determining the number of rows already scanned is very difficult.

If I scroll too far ahead, I want to display a message like: "Reached last row in look-ahead buffer...query has not completed yet".

So, the user has scrolled past the 23rd row, but the query is not yet completed.

Can this be done? Maybe like: spawn/exec, declare scroll cursor, open, fetch, etc.?

There are a couple of issues here. The DBMS (true of most databases, and certainly of IDS) remains tied up as far as the current connection on processing the one statement. Obtaining feedback on how a query has progressed is difficult. You could look at the estimated rows returned when the query was started (information in the SQLCA structure), but those values are apt to be wrong. You'd have to decide what to do when you reach row 200 of 23, or you only get to row 23 of 5,697. It is better than nothing, but it is not reliable. Determining how far a query has progressed is very difficult. And some queries require an actual sort operation, which means that it is very hard to predict how long it will take because no data is available until the sort is done (and once the sort is done, there is only the time taken to communicate between the DBMS and the application to hold up the delivery of the data).

Informix 4GL has many virtues, but thread support is not one of them. The language was not designed with thread safety in mind, and there is no easy way to retrofit it into the product.

I do think that what you are seeking would be most easily supported by two threads. In a single-threaded program like an I4GL program, there isn't an easy way to go off and fetch rows while waiting for the user to type some more input (such as 'scroll down the next page full of data').

The FIRST ROWS optimization is a hint to the DBMS; it may or may not give a significant benefit to the perceived performance. Overall, it typically means that the query is processed less optimally from the DBMS perspective, but getting results to the user quickly can be more important than the workload on the DBMS.


Somewhere down below in a much down-voted answer, Frank shouted (but please don't SHOUT):

That's exactly what I want to do, spawn a new process to begin displaying first_rows and scroll through them even though the query has not completed.

OK. The difficulty here is organizing the IPC between the two client-side processes. If both are connected to the DBMS, they have separate connections, and therefore the temporary tables and cursors of one session are not available to the other.

When a query is executed, a temporary table is created to hold the query results for the current list. Does the IDS engine place an exclusive lock on this temp table until the query completes?

Not all queries result in a temporary table, though the result set for a scroll cursor usually does have something approximately equivalent to a temporary table. IDS does not need to place a lock on the temporary table backing a scroll cursor because only IDS can access the table. If it was a regular temp table, there'd still not be a need to lock it because it cannot be accessed except by the session that created it.

What I meant with the 500k rows, is nrows in the queried table, not how many expected results will be returned.

Maybe a more accurate status message would be:

Searching 500,000 rows...found 23 matching rows so far

I understand that an accurate count of nrows can be obtained in sysmaster:sysactptnhdr.nrows?

Probably; you can also get a fast and accurate count with 'SELECT COUNT(*) FROM TheTable'; this does not scan anything but simply accesses the control data - probably effectively the same data as in the nrows column of the SMI table sysmaster:sysactptnhdr.

So, spawning a new process is not clearly a recipe for success; you have to transfer the query results from the spawned process to the original process. As I stated, a multithreaded solution with separate display and database access threads would work after a fashion, but there are issues with doing this using I4GL because it is not thread-aware. You'd still have to decide how the client-side code is going store the information for display.



回答3:

It can be done, with an analytic function, but Oracle has to full scan the table to determine the count no matter what you do if there's no index. An analytic could simplify your query:

SELECT x,y,z, count(*) over () the_count
  FROM your_table
 WHERE ...

Each row returned will have the total count of rows returned by the query in the_count. As I said, however, Oracle will have to finish the query to determine the count before anything is returned.

Depending on how you're processing the query (e.g., a PL/SQL block in a form), you could use the above query to open a cursor, then loop through the cursor and display sets of records and give the user the chance to cancel.



回答4:

I'm not sure how you would accomplish this, since the query has to complete prior to the results being known. No RDBMS (that I know of) offers any means of determining how many results to a query have been found prior to the query completing.

I can't speak factually for how expensive such a feature would be in Oracle because I have never seen the source code. From the outside in, however, I think it would be rather costly and could double (if not more) the length of time a query took to complete. It would mean updating an atomic counter after each result, which isn't cheap when you're talking millions of possible rows.



回答5:

So I am putting up my comments into this answer- In terms of Oracle.

Oracle maintains its own buffer cache inside the system global area (SGA) for each instance. The hit ratio on the buffer cache depends on the sizing and reaches 90% most of the time, which means 9 out of 10 hits will be satisfied without reaching the disk.

Considering the above, even if there is a "way" (so to speak) to access the buffer chache for a query you run, the results would highly depend on the cache sizing factor. If a buffer cache is too small, the cache hit ratio will be small and more physical disk I/O will result, which will render the buffer cache un-reliable in terms of temp-data content. If a buffer cache is too big, then parts of the buffer cache will be under-utilized and memory resources will be wasted, which in terms would render too much un-necessary processing trying to access the buffer cache while in order to peek in it for the data you want.

Also depending on your cache sizing and SGA memory it would be upto the ODBC driver / optimizer to determine when and how much to use what (cache buffering or Direct Disk I/O).

In terms of trying to access the "buffer cache" to find "the row" you are looking for, there might be a way (or in near future) to do it, but there would be no way to know if what you are looking for ("The row") is there or not after all.

Also, full table scans of large tables usually result in physical disk reads and a lower buffer cache hit ratio.You can get an idea of full table scan activity at the data file level by querying v$filestat and joining to SYS.dba_data_files. Following is a query you can use and sample results:

 SELECT   A.file_name, B.phyrds, B.phyblkrd
 FROM     SYS.dba_data_files A, v$filestat B
 WHERE    B.file# = A.file_id
 ORDER BY A.file_id;

Since this whole ordeal is highly based on multiple parameters and statistics, the results of what you are looking for may remain a probability driven off of those facotrs.