]> git.immae.eu Git - github/fretlink/tap-google-sheets.git/blob - README.md
Add PR template
[github/fretlink/tap-google-sheets.git] / README.md
1 # tap-google-sheets
2
3 This is a [Singer](https://singer.io) tap that produces JSON-formatted data
4 following the [Singer
5 spec](https://github.com/singer-io/getting-started/blob/master/SPEC.md).
6
7 This tap:
8
9 - Pulls raw data from the [Google Sheets v4 API](https://developers.google.com/sheets/api)
10 - Extracts the following endpoints:
11 - [File Metadata](https://developers.google.com/drive/api/v3/reference/files/get)
12 - [Spreadsheet Metadata](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/get)
13 - [Spreadsheet Values](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
14 - For each Sheet:
15 - Outputs the schema for each resource (based on the column header and datatypes of first row of data)
16 - Outputs a record for all columns with column headers, and for each row of data until it reaches an empty row
17
18 ## API Endpoints
19 [**file (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
20 - Endpoint: https://www.googleapis.com/drive/v3/files/${spreadsheet_id}?fields=id,name,createdTime,modifiedTime,version
21 - Primary keys: id
22 - Replication strategy: Full (GET file audit data for spreadsheet_id in config)
23 - Process/Transformations: Replicate Data if Modified
24
25 [**metadata (GET)**](https://developers.google.com/drive/api/v3/reference/files/get)
26 - Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}?includeGridData=true&ranges=1:2
27 - This endpoint eturns spreadsheet metadata, sheet metadata, and value metadata (data type information)
28 - Primary keys: spreadsheetId, title, field_name
29 - Foreign keys: None
30 - Replication strategy: Full (get and replace file metadata for spreadshee_id in config)
31 - Process/Transformations:
32 - Verify Sheets: Check sheets exist (compared to catalog) and check gridProperties (available area)
33 - sheetId, title, index, gridProperties (rowCount, columnCount)
34 - Verify Field Headers (1st row): Check field headers exist (compared to catalog), missing headers (columns to skip), column order/position, and column uniqueness
35 - Header's field_name, position: data.rowData[0].values[i].formattedValue
36 - Create/Verify Datatypes (2nd row):
37 - Row 2's datatype, format: data.rowData[1].values[i]
38 - First check:
39 - [effectiveValue: key](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/other#ExtendedValue)
40 - Valid types: numberValue, stringValue, boolValue
41 - Invalid types: formulaValue, errorValue
42 - Then check:
43 - [effectiveFormat.numberFormat.type](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormatType)
44 - Valid types: UNEPECIFIED, TEXT, NUMBER, PERCENT, CURRENCY, DATE, TIME, DATE_TIME, SCIENTIFIC
45 - If DATE or DATE_TIME, set JSON schema datatype = string and format = date-time
46 - [effectiveFormat.numberFormat.pattern](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets/cells#NumberFormat)
47
48 [**values (GET)**](https://developers.google.com/sheets/api/reference/rest/v4/spreadsheets.values/get)
49 - Endpoint: https://sheets.googleapis.com/v4/spreadsheets/${spreadsheet_id}/values/'${sheet_name}'!${row_range}?dateTimeRenderOption=SERIAL_NUMBER&valueRenderOption=UNFORMATTED_VALUE&majorDimension=ROWS
50 - This endpoint loops through sheets and row ranges to get the [unformatted values](https://developers.google.com/sheets/api/reference/rest/v4/ValueRenderOption) (effective values only), dates and datetimes as [serial numbers](https://developers.google.com/sheets/api/reference/rest/v4/DateTimeRenderOption)
51 - Primary keys: row
52 - Replication strategy: Full (GET file audit data for spreadsheet_id in config)
53 - Process/Transformations:
54 - Loop through sheets (compared to catalog selection)
55 - Send metadata for sheet
56 - Loop through ranges of rows until reaching empty row or area max row (from sheet metadata)
57 - Transform values, if necessary (dates, date-times, boolean, integer, numers)
58 - Process/send records
59
60 ## Authentication
61 The [**Google Sheets Setup & Authentication**](https://drive.google.com/open?id=1FojlvtLwS0-BzGS37R0jEXtwSHqSiO1Uw-7RKQQO-C4) Google Doc provides instructions show how to configure the Google Cloud API credentials to enable Google Drive and Google Sheets APIs, configure Google Cloud to authorize/verify your domain ownership, generate an API key (client_id, client_secret), authenticate and generate a refresh_token, and prepare your tap config.json with the necessary parameters.
62 - Enable Googe Drive APIs and Authorization Scope: https://www.googleapis.com/auth/drive.metadata.readonly
63 - Enable Google Sheets API and Authorization Scope: https://www.googleapis.com/auth/spreadsheets.readonly
64 - Tap config.json parameters:
65 - client_id: identifies your application
66 - client_secret: authenticates your application
67 - refresh_token: generates an access token to authorize your session
68 - spreadsheet_id: unique identifier for each spreadsheet in Google Drive
69 - start_date: absolute minimum start date to check file modified
70 - user_agent: tap-name and email address; identifies your application in the Remote API server logs
71
72 ## Quick Start
73
74 1. Install
75
76 Clone this repository, and then install using setup.py. We recommend using a virtualenv:
77
78 ```bash
79 > virtualenv -p python3 venv
80 > source venv/bin/activate
81 > python setup.py install
82 OR
83 > cd .../tap-google-sheets
84 > pip install .
85 ```
86 2. Dependent libraries
87 The following dependent libraries were installed.
88 ```bash
89 > pip install target-json
90 > pip install target-stitch
91 > pip install singer-tools
92 > pip install singer-python
93 ```
94 - [singer-tools](https://github.com/singer-io/singer-tools)
95 - [target-stitch](https://github.com/singer-io/target-stitch)
96
97 3. Create your tap's `config.json` file. Include the client_id, client_secret, refresh_token, site_urls (website URL properties in a comma delimited list; do not include the domain-level property in the list), start_date (UTC format), and user_agent (tap name with the api user email address).
98
99 ```json
100 {
101 "client_id": "YOUR_CLIENT_ID",
102 "client_secret": "YOUR_CLIENT_SECRET",
103 "refresh_token": "YOUR_REFRESH_TOKEN",
104 "spreadsheet_id": "YOUR_GOOGLE_SPREADSHEET_ID",
105 "start_date": "2019-01-01T00:00:00Z",
106 "user_agent": "tap-google-sheets <api_user_email@example.com>"
107 }
108 ```
109
110 Optionally, also create a `state.json` file. `currently_syncing` is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off.
111 Only the `performance_reports` uses a bookmark. The date-time bookmark is stored in a nested structure based on the endpoint, site, and sub_type.
112
113 ```json
114 {
115 "currently_syncing": "file_metadata",
116 "bookmarks": {
117 "file_metadata": "2019-09-27T22:34:39.000000Z"
118 }
119 }
120 ```
121
122 4. Run the Tap in Discovery Mode
123 This creates a catalog.json for selecting objects/fields to integrate:
124 ```bash
125 tap-google-sheets --config config.json --discover > catalog.json
126 ```
127 See the Singer docs on discovery mode
128 [here](https://github.com/singer-io/getting-started/blob/master/docs/DISCOVERY_MODE.md#discovery-mode).
129
130 5. Run the Tap in Sync Mode (with catalog) and [write out to state file](https://github.com/singer-io/getting-started/blob/master/docs/RUNNING_AND_DEVELOPING.md#running-a-singer-tap-with-a-singer-target)
131
132 For Sync mode:
133 ```bash
134 > tap-google-sheets --config tap_config.json --catalog catalog.json > state.json
135 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
136 ```
137 To load to json files to verify outputs:
138 ```bash
139 > tap-google-sheets --config tap_config.json --catalog catalog.json | target-json > state.json
140 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
141 ```
142 To pseudo-load to [Stitch Import API](https://github.com/singer-io/target-stitch) with dry run:
143 ```bash
144 > tap-google-sheets --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json
145 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
146 ```
147
148 6. Test the Tap
149
150 While developing the Google Search Console tap, the following utilities were run in accordance with Singer.io best practices:
151 Pylint to improve [code quality](https://github.com/singer-io/getting-started/blob/master/docs/BEST_PRACTICES.md#code-quality):
152 ```bash
153 > pylint tap_google_sheets -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments
154 ```
155 Pylint test resulted in the following score:
156 ```bash
157 Your code has been rated at 9.78/10
158 ```
159
160 To [check the tap](https://github.com/singer-io/singer-tools#singer-check-tap) and verify working:
161 ```bash
162 > tap-google-sheets --config tap_config.json --catalog catalog.json | singer-check-tap > state.json
163 > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
164 ```
165 Check tap resulted in the following:
166 ```bash
167 The output is valid.
168 It contained 3881 messages for 13 streams.
169
170 13 schema messages
171 3841 record messages
172 27 state messages
173
174 Details by stream:
175 +----------------------+---------+---------+
176 | stream | records | schemas |
177 +----------------------+---------+---------+
178 | file_metadata | 1 | 1 |
179 | spreadsheet_metadata | 1 | 1 |
180 | Test-1 | 9 | 1 |
181 | Test 2 | 2 | 1 |
182 | SKU COGS | 218 | 1 |
183 | Item Master | 216 | 1 |
184 | Retail Price | 273 | 1 |
185 | Retail Price NEW | 284 | 1 |
186 | Forecast Scenarios | 2681 | 1 |
187 | Promo Type | 91 | 1 |
188 | Shipping Method | 47 | 1 |
189 | sheet_metadata | 9 | 1 |
190 | sheets_loaded | 9 | 1 |
191 +----------------------+---------+---------+
192 ```
193 ---
194
195 Copyright &copy; 2019 Stitch