diff options
author | Jeff Huth <jeff.huth@bytecode.io> | 2019-11-13 17:03:56 -0800 |
---|---|---|
committer | Jeff Huth <jeff.huth@bytecode.io> | 2019-11-13 17:03:56 -0800 |
commit | 89643ba6fa98db82efd3246805ef801a8bfb5c81 (patch) | |
tree | 739027b4e827def2db81631c9d6ed58ec2b97809 /README.md | |
parent | 5f8005471d3affaaf23489df93a58ca64c3da3ca (diff) | |
download | tap-google-sheets-89643ba6fa98db82efd3246805ef801a8bfb5c81.tar.gz tap-google-sheets-89643ba6fa98db82efd3246805ef801a8bfb5c81.tar.zst tap-google-sheets-89643ba6fa98db82efd3246805ef801a8bfb5c81.zip |
Initial commit
Discovery mode works. Still working on normal sync.
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 173 |
1 files changed, 172 insertions, 1 deletions
@@ -1 +1,172 @@ | |||
1 | tap-google-sheets | 1 | # tap-google-sheets |
2 | |||
3 | This is a [Singer](https://singer.io) tap that produces JSON-formatted data | ||
4 | following the [Singer | ||
5 | spec](https://github.com/singer-io/getting-started/blob/master/SPEC.md). | ||
6 | |||
7 | This tap: | ||
8 | |||
9 | - Pulls raw data from the [Google Sheets v4 API](https://developers.google.com/sheets/api) | ||
10 | - Extracts the following endpoints: | ||
11 | - [Spreadsheet File](https://developers.google.com/drive/api/v3/reference/files/get) | ||
12 | - [Spreadsheet Metadata, Sheets, Headers](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/get) | ||
13 | - [Spreadsheet Values](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get) | ||
14 | - For each Sheet: | ||
15 | - Outputs the schema for each resource (based on the column header and datatypes of first row of data) | ||
16 | - Outputs a record for all columns with column headers, and for each row of data until it reaches an empty row | ||
17 | |||
18 | ## API Endpoints | ||
19 | [**file (GET)**](https://developers.google.com/drive/api/v3/reference/files/get) | ||
20 | - Endpoint: https://www.googleapis.com/drive/v3/files/${spreadsheet_id}?fields=id,name,createdTime,modifiedTime,version | ||
21 | - Primary keys: id | ||
22 | - Replication strategy: Full (GET file audit data for spreadsheet_id in config) | ||
23 | - Process/Transformations: Replicate Data if Modified | ||
24 | |||
25 | [**metadata (GET)**](https://developers.google.com/drive/api/v3/reference/files/get) | ||
26 | - Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}?includeGridData=true&ranges=1:2 | ||
27 | - This endpoint eturns spreadsheet metadata, sheet metadata, and value metadata (data type information) | ||
28 | - Primary keys: spreadsheetId, title, field_name | ||
29 | - Foreign keys: None | ||
30 | - Replication strategy: Full (get and replace file metadata for spreadshee_id in config) | ||
31 | - Process/Transformations: | ||
32 | - Verify Sheets: Check sheets exist (compared to catalog) and check gridProperties (available area) | ||
33 | - sheetId, title, index, gridProperties (rowCount, columnCount) | ||
34 | - Verify Field Headers (1st row): Check field headers exist (compared to catalog), missing headers (columns to skip), column order/position, and column uniqueness | ||
35 | - Header's field_name, position: data.rowData[0].values[i].formattedValue | ||
36 | - Create/Verify Datatypes (2nd row): | ||
37 | - Row 2's datatype, format: data.rowData[1].values[i] | ||
38 | - First check: | ||
39 | - [effectiveValue: key](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/other#ExtendedValue) | ||
40 | - Valid types: numberValue, stringValue, boolValue | ||
41 | - Invalid types: formulaValue, errorValue | ||
42 | - Then check: | ||
43 | - [effectiveFormat.numberFormat.type](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormatType) | ||
44 | - Valid types: UNEPECIFIED, TEXT, NUMBER, PERCENT, CURRENCY, DATE, TIME, DATE_TIME, SCIENTIFIC | ||
45 | - If DATE or DATE_TIME, set JSON schema datatype = string and format = date-time | ||
46 | - [effectiveFormat.numberFormat.pattern](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormat) | ||
47 | |||
48 | [**values (GET)**](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get) | ||
49 | - Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}/values/'${sheet_name}'!${row_range}?dateTimeRenderOption=SERIAL_NUMBER&valueRenderOption=UNFORMATTED_VALUE&majorDimension=ROWS | ||
50 | - This endpoint loops through sheets and row ranges to get the [unformatted values](https://developers.google.com/sheets/api/reference/rest/v4/ValueRenderOption) (effective values only), dates and datetimes as [serial numbers](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption) | ||
51 | - Primary keys: row | ||
52 | - Replication strategy: Full (GET file audit data for spreadsheet_id in config) | ||
53 | - Process/Transformations: | ||
54 | - Loop through sheets (compared to catalog selection) | ||
55 | - Send metadata for sheet | ||
56 | - Loop through ranges of rows until reaching empty row or area max row (from sheet metadata) | ||
57 | - Transform values, if necessary (dates, date-times, boolean, integer, numers) | ||
58 | - Process/send records | ||
59 | |||
60 | ## Authentication | ||
61 | The [**Google Sheets Setup & Authentication**](https://drive.google.com/open?id=1FojlvtLwS0-BzGS37R0jEXtwSHqSiO1Uw-7RKQQO-C4) Google Doc provides instructions show how to configure the Google Cloud API credentials to enable Google Drive and Google Sheets APIs, configure Google Cloud to authorize/verify your domain ownership, generate an API key (client_id, client_secret), authenticate and generate a refresh_token, and prepare your tap config.json with the necessary parameters. | ||
62 | - Enable Googe Drive APIs and Authorization Scope: https://www.googleapis.com/auth/drive.metadata.readonly | ||
63 | - Enable Google Sheets API and Authorization Scope: https://www.googleapis.com/auth/spreadsheets.readonly | ||
64 | - Tap config.json parameters: | ||
65 | - client_id: identifies your application | ||
66 | - client_secret: authenticates your application | ||
67 | - refresh_token: generates an access token to authorize your session | ||
68 | - spreadsheet_id: unique identifier for each spreadsheet in Google Drive | ||
69 | - start_date: absolute minimum start date to check file modified | ||
70 | - user_agent: tap-name and email address; identifies your application in the Remote API server logs | ||
71 | |||
72 | ## Quick Start | ||
73 | |||
74 | 1. Install | ||
75 | |||
76 | Clone this repository, and then install using setup.py. We recommend using a virtualenv: | ||
77 | |||
78 | ```bash | ||
79 | > virtualenv -p python3 venv | ||
80 | > source venv/bin/activate | ||
81 | > python setup.py install | ||
82 | OR | ||
83 | > cd .../tap-google-sheets | ||
84 | > pip install . | ||
85 | ``` | ||
86 | 2. Dependent libraries | ||
87 | The following dependent libraries were installed. | ||
88 | ```bash | ||
89 | > pip install target-json | ||
90 | > pip install target-stitch | ||
91 | > pip install singer-tools | ||
92 | > pip install singer-python | ||
93 | ``` | ||
94 | - [singer-tools](https://github.com/singer-io/singer-tools) | ||
95 | - [target-stitch](https://github.com/singer-io/target-stitch) | ||
96 | |||
97 | 3. Create your tap's `config.json` file. Include the client_id, client_secret, refresh_token, site_urls (website URL properties in a comma delimited list; do not include the domain-level property in the list), start_date (UTC format), and user_agent (tap name with the api user email address). | ||
98 | |||
99 | ```json | ||
100 | { | ||
101 | "client_id": "YOUR_CLIENT_ID", | ||
102 | "client_secret": "YOUR_CLIENT_SECRET", | ||
103 | "refresh_token": "YOUR_REFRESH_TOKEN", | ||
104 | "spreadsheet_id": "YOUR_GOOGLE_SPREADSHEET_ID", | ||
105 | "start_date": "2019-01-01T00:00:00Z", | ||
106 | "user_agent": "tap-google-sheets <api_user_email@example.com>" | ||
107 | } | ||
108 | ``` | ||
109 | |||
110 | Optionally, also create a `state.json` file. `currently_syncing` is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off. | ||
111 | Only the `performance_reports` uses a bookmark. The date-time bookmark is stored in a nested structure based on the endpoint, site, and sub_type. | ||
112 | |||
113 | ```json | ||
114 | { | ||
115 | "currently_syncing": "file_metadata", | ||
116 | "bookmarks": { | ||
117 | "file_metadata": "2019-09-27T22:34:39.000000Z", | ||
118 | "sheets_loaded": "2019-09-27T21:55:42.000000Z" | ||
119 | } | ||
120 | } | ||
121 | ``` | ||
122 | |||
123 | 4. Run the Tap in Discovery Mode | ||
124 | This creates a catalog.json for selecting objects/fields to integrate: | ||
125 | ```bash | ||
126 | tap-google-sheets --config config.json --discover > catalog.json | ||
127 | ``` | ||
128 | See the Singer docs on discovery mode | ||
129 | [here](https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#discovery-mode). | ||
130 | |||
131 | 5. Run the Tap in Sync Mode (with catalog) and [write out to state file](https://github.com/singer-io/getting-started/blob/master/docs/RUNNING_AND_DEVELOPING.md#running-a-singer-tap-with-a-singer-target) | ||
132 | |||
133 | For Sync mode: | ||
134 | ```bash | ||
135 | > tap-google-sheets --config tap_config.json --catalog catalog.json > state.json | ||
136 | > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json | ||
137 | ``` | ||
138 | To load to json files to verify outputs: | ||
139 | ```bash | ||
140 | > tap-google-sheets --config tap_config.json --catalog catalog.json | target-json > state.json | ||
141 | > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json | ||
142 | ``` | ||
143 | To pseudo-load to [Stitch Import API](https://github.com/singer-io/target-stitch) with dry run: | ||
144 | ```bash | ||
145 | > tap-google-sheets --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json | ||
146 | > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json | ||
147 | ``` | ||
148 | |||
149 | 6. Test the Tap | ||
150 | |||
151 | While developing the Google Search Console tap, the following utilities were run in accordance with Singer.io best practices: | ||
152 | Pylint to improve [code quality](https://github.com/singer-io/getting-started/blob/master/docs/BEST_PRACTICES.md#code-quality): | ||
153 | ```bash | ||
154 | > pylint tap_google_sheets -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments | ||
155 | ``` | ||
156 | Pylint test resulted in the following score: | ||
157 | ```bash | ||
158 | TBD | ||
159 | ``` | ||
160 | |||
161 | To [check the tap](https://github.com/singer-io/singer-tools#singer-check-tap) and verify working: | ||
162 | ```bash | ||
163 | > tap-google-sheets --config tap_config.json --catalog catalog.json | singer-check-tap > state.json | ||
164 | > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json | ||
165 | ``` | ||
166 | Check tap resulted in the following: | ||
167 | ```bash | ||
168 | TBD | ||
169 | ``` | ||
170 | --- | ||
171 | |||
172 | Copyright © 2019 Stitch | ||