Module core.steps.data.bq_data_step¶
Base interface for BQ Data Step
Functions¶
ReadFromBigQuery(pipeline: apache_beam.pipeline.Pipeline, query_project: str, query_dataset: str, query_table: str, gcs_location: str, dest_project: str, query_limit: int = None) ‑> apache_beam.pvalue.PCollection
: The Beam PTransform used to read data from a specific BQ table.
Args:
pipeline: Input beam.Pipeline object coming from a TFX Executor.
query_project: Google Cloud project where the target table is
located.
query_dataset: Google Cloud project where the target dataset is
located.
query_table: Name of the target BigQuery table.
gcs_location: Name of the Google Cloud Storage bucket where
the extracted table should be written as a string.
dest_project: Additional Google Cloud Project identifier.
query_limit: Optional, maximum limit of how many datapoints
to read from the specified BQ table.
Returns:
A beam.PCollection of data points. Each row in the BigQuery table
represents a single data point.
Classes¶
BQDataStep(query_project: str, query_dataset: str, query_table: str, gcs_location: str, dest_project: str = None, query_limit: int = None, schema: Dict = None)
: A step that reads in data from a Google BigQuery table supplied on
construction.
BigQuery (BQ) data step constructor. Targets a single BigQuery table
within a public or private project and dataset. In order to use
private BQ tables in your pipelines, you may be required to set the
GOOGLE_APPLICATION_CREDENTIALS environment variable within your code,
e.g. like so:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/path/to/file.json",
pointing to a valid service account file for your project and dataset
that has the necessary permissions.
Args:
query_project: Google Cloud project where the target table
is located.
query_dataset: Google Cloud project where the target dataset
is located.
query_table: Name of the target BigQuery table.
gcs_location: Name of the Google Cloud Storage bucket where the
extracted table should be written as a string.
dest_project: Additional Google Cloud Project identifier.
query_limit: Optional, maximum limit of how many datapoints
to read from the specified BQ table.
schema: Optional schema providing data type information about
the data source.
### Ancestors (in MRO)
* zenml.core.steps.data.base_data_step.BaseDataStep
* zenml.core.steps.base_step.BaseStep
### Methods
`read_from_source(self)`
: